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Foreword 



The Fifth International Conference on Implementation and Application of Au- 
tomata (CIAA 2000) was held at the University of Western Ontario in London, 
Ontario, Canada on July 24-25, 2000. This conference series was formerly called 
the International Workshop on Implementing Automata (WIA) 

This volume of the Lecture Notes in Computer Science series contains all the 
papers that were presented at CIAA 2000, and also the abstracts of the poster 
papers that were displayed during the conference. 

The conference addressed issues in automata application and implementa- 
tion. The topics of the papers presented at this conference ranged from automata 
applications in software engineering, natural language and speech recognition, 
and image processing, to new representations and algorithms for efficient imple- 
mentation of automata and related structures. 

Automata theory is one of the oldest areas in computer science. Research in 
automata theory has always been motivated by its applications since its early 
stages of development. In the 1960s and 1970s, automata research was motiva- 
ted heavily by problems arising from compiler construction, circuit design, string 
matching, etc. In recent years, many new applications have been found in various 
areas of computer science as well as in other disciplines. Examples of the new 
applications include statecharts in object-oriented modeling, finite transducers 
in natural language processing, and nondeterministic finite-state models in com- 
munication protocols. Many of the new applications do not and cannot simply 
apply the existing models and algorithms in automata theory to their problems. 
New models, or modifications of the existing models, are needed to satisfy their 
requirements. A feature that can be found in many of the new applications is 
that the sizes of the problems in practice are astronomically larger than those 
occuring in the traditional applications. New algorithms and new representati- 
ons of automata are in demand to reduce the time and space requirements of 
the computation. 

The CIAA conference series provides a forum for those new problems and new 
challenges. In these conferences, both theoretical and practical results related to 
application and implementation of automata can be presented and discussed, 
and software packages and toolkits can be demonstrated. The participants of 
the conference series have come from both research institutions and industry. 

I wish to thank all the program committee members and referees for their 
efforts in refereeing and selecting papers. This volume is edited with much help 
from Andrei Paun and Mihaela Paun. Their efforts are very much appreciated. 

We wish to thank EATCS and ACM SIGACT for their sponsorship and 
HyperVision for their donation to the conference. We also thank the editors of 
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Foreword 



the Lecture Notes in Computer Science series and Springer-Verlag, in particular 
Ms. Anna Kramer, for their help in publishing this volume. 

May 2001 Sheng Yu 
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Synthesizing State-Based Object Systems from 
LSC Specifications 



David Harel and Hillel Kugler 

Department of Computer Science and Applied Mathematics 
The Weizmann Institute of Science, Rehovot, Israel 
{harel ,kugler}@wisdom. weizmarm. ac . il 



Abstract. Live sequence charts (LSCs) have been defined recently as 
an extension of message sequence charts (MSCs; or their UML variant, 
sequence diagrams) for rich inter-object specification. One of the main 
additions is the notion of universal charts and hot, mandatory behavior, 
which, among other things, enables one to specify forbidden scenarios. 
LSCs are thus essentially as expressive as statecharts. This paper deals 
with synthesis, which is the problem of deciding, given an LSC specifi- 
cation, if there exists a satisfying object system and, if so, to synthesize 
one automatically. The synthesis problem is crucial in the development 
of complex systems, since sequence diagrams serve as the manifestation 
of use cases — whether used formally or informally — and if synthesiz- 
able they could lead directly to implementation. Synthesis is considerably 
harder for LSCs than for MSCs, and we tackle it by defining consistency, 
showing that an entire LSC specification is consistent iff it is satisfiable 
by a state- based object system, and then synthesizing a satisfying system 
as a collection of finite state machines or statecharts. 



1 Introduction 

1.1 Background and Motivation 

Message sequence charts (MSCs) are a popular means for specifying scenarios 
that capture the communication between processes or objects. They are partic- 
ularly useful in the early stages of system development. MSCs have found their 
way into many methodologies, and are also a part of the UML jUMTjdoc'^ . where 
they are called sequence diagrams. There is also a standard for the MSC lan- 
guage, which has appeared as a recommendation of the ITU IZl 2fll (previously 
called the CCITT). 

Damm and Harel have raised a few problematic issues regarding 

MSCs, most notably some severe limitations in their expressive power. The se- 
mantics of the language is a rather weak partial ordering of events. It can be used 
to make sure that the sending and receiving of messages, if occurring, happens 
in the right order, but very little can be said about what the system actually 
does, how it behaves when false conditions are encountered, and which scenarios 
are forbidden. This weakness prevents sequence charts from becoming a serious 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 1-E3 2001. 
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means for describing system behavior, e.g., as an adequate language for sub- 
stantiating the use-cases of [,l92HJMLdocs| . Damm and Harel |UH99| then go 
on to define live sequence charts (LSCs), as a rather rich extension of MSCs. 
The main addition is liveness, or universality, which provides constructs for 
specifying not only possible behavior, but also necessary, or mandatory behav- 
ior, both globally, on the level of an entire chart and locally, when specifying 
events, conditions and progress over time within a chart. Liveness allows for the 
specification of “anti-scenarios” (forbidden ones), and strengthens structuring 
constructs like as subcharts, branching and iteration. LSCs are essentially as 
expressive as statecharts. As explained in PM, the new language can serve as 
the basis of tools supporting specification and analysis of use-cases and scenarios 

— both formally and informally — thus providing a far more powerful means 
for setting requirements for complex systems. 

The availability of a scenario-oriented language with this kind of expressive 
power is also a prerequisite to addressing one of the central problems in behav- 
ioral specification of systems: (in the words of [DH99j 1 to relate scenario-based 
inter-object specification with state machine intra-object specification. One of 
the most pressing issues in relating these two dual approaches to specifying 
behavior is synthesis , i.e., the problem of automatically constructing a behav- 
iorally equivalent state-based specification from the scenarios. Specifically, we 
want to be able to generate a statechart for each object from an LSC specifi- 
cation of the system, if this is possible in principle. The synthesis problem is 
crucial in the development of complex object-oriented systems, since sequence 
diagrams serve to instantiate use cases. If we can synthesize state-based systems 
from them, we can use tools such as Rhapsody (see (HI22!) to generate running 
code directly from them, and we will have taken a most significant step towards 
going automatically from instantiated use-cases to implementation, which is an 
exciting (and ambitious!) possibility. See the discussion in the recent |Hflflj . And, 
of course, we couldn’t have said this about the (far easier) problem of synthesiz- 
ing from conventional sequence diagrams, or MSCs, since their limited expressive 
power would render the synthesized system too weak to be really useful; in par- 
ticular, there would be no way to guarantee that the synthesized system would 
satisfy safety constraints (i.e., that bad things — such as a missile firing with 
the radar not locked on the target — will not happen). 

In this paper we address the synthesis problem in a slightly restricted LSC 
language, and for an object model in which behavior of objects is described by 
state machines with synchronous communication. For the most part the resulting 
state machines are orthogonality-free and fiat, but in the last section of the paper 
we sketch a construction that takes advantage of the more advanced constructs 
of statecharts. 

An important point to be made is that the most interesting and difficult as- 
pects in the development of complex systems stem from the interaction between 
different features, which in our case is modeled by the requirements made in 
different charts. Hence, a synthesis approach that deals only with a single chart 

— even if it is an LSC — does not solve the crux of the problem. 
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The paper is organized as follows. Section 2 introduces the railcar system 
of |H(197| and shows how it can be specified using LSCs. This example will be 
used throughout the paper to explain and illustrate our main ideas. Section 3 
then goes on to explain the LSC semantics and to define when an object system 
satisfies an LSC specification. In Section 4 we define the consistency of an LSC 
specification and prove that consistency is a necessary and sufficient condition for 
satisfiability. We then describe an algorithm for deciding if a given specification 
is consistent. The synthesis problem is addressed in Section 5, where we present 
a synthesis algorithm that assumes fairness. We then go on to show how this 
algorithm can be extended to systems that do not guarantee fairness. (Lacking 
fairness, the system synthesized does not generate the most general language 
as it does in the presence of fairness.) In Section 6 we outline an algorithm for 
synthesizing statecharts, with their concurrent, orthogonal state components. 



1.2 Related Work 



As far as the limited case of classical message sequence charts goes, there has 
been quite some work on synthesis from them. This includes the SCED method 
[IKM94IKSTM98| and synthesis in the framework of ROOM charts |LMR98j . 
Other relevant work appears in ISnf)3IAYf)9iAEYflflRKf)8IKOSRf)f)IWSnni . The 



full paper |HKg99| provides brief descriptions of these effortsfl In addition, there 
is the work described in [IK Wflfl) . which deals with LSCs, but synthesizes from 
a single chart only: an LSC is translated into a timed Biichi automaton (from 
which code can be derived). 

In addition to synthesis work directly from sequence diagrams of one kind or 
another, one should realize that constructing a program from a specification is 
a long-known general and fundamental problem. There has been much research 
on constructing a program from a specification given in temporal logic. 

The early work on this kind of synthesis considered closed systems, that do 
not interact with the environment jIVI W^SlIltX . In this case a program can be 
extracted from a constructive proof that the formula is satisfiable. This approach 
is not suited to synthesizing open systems that interact with the environment, 
since satisfiability implies the existence of an environment in which the program 
satisfies the formula, but the synthesized program cannot restrict the environ- 
ment. Later work in pT},89aiPIi,89hlAIAV89IWD91 j dealt with the synthesis of 
open systems from linear temporal logic specifications. The realizability prob- 
lem is reduced to checking the nonemptiness of tree automata, and a finite state 
program can be synthesized from an infinite tree accepted by the automaton. 

In pM1| . synthesis of a distributed reactive system is considered. Given an 
architecture — a set of processors and their interconnection scheme — a solution 
to the synthesis problem yields finite state programs, one for each processor, 
whose joint behavior satisfies the specification. It is shown in pRlil that the 
realizability of a given specification over a given architecture is undecidable. 



^ See technical report MCS99-20, October 1999, The Weizmann Institute of Science, 
at http:/ /www. wisdom. weizmann. ac.il/reports.html. 
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Previous work assumed the easy architecture of a single processor, and then 
realizability was decidable. In our work, an object of the synthesized system 
can share all the information it has with all other objects, so the undecidability 
results of ipnmi do not apply here. 

Another important approach discussed in |PP9nj is first synthesizing a sin- 
gle processor program, and then decomposing it to yield a set of programs for 
the different processors. The problem of finite-state decomposition is an easier 
problem than realizing an implementation. Indeed, it is shown in [FnMi that 
decompositionality of a given finite state program into a set of programs over 
a given architecture is decidable. The construction we present in Section 5 can 
be viewed as following parts of this approach by initially synthesizing a global 
system automaton describing the behavior of the entire system and then dis- 
tributing it, yielding a set of state machines, one for each object in the system. 
However, the work on temporal logic synthesis assumes a model in which the 
system and the environment take turns making moves, each side making one 
move in its turn. We consider a more realistic model, in which after each move 
by the environment, the system can make any finite number of moves before the 
environment makes its next move. 



2 An Example 

In this section we introduce the railcar system, which will be used throughout 
the paper as an example to explain and illustrate the main ideas and results. 
A detailed description of the system appears in while uses it to 

illustrate LSC specifications. To make this paper self contained and to illustrate 
the main ideas of LSCs, we now show some of the basic objects and scenarios of 
the example. 

The automated railcar system consists of six terminals, located on a cyclic 
path. Each pair of adjacent terminals is connected by two rail tracks. Several 
railcars are available to transport passengers between terminals. 

Here now is some of the required behavior, using LSC’s. Fig. 1 describes a car 
departing from a terminal. The objects participating in this scenario are cruiser, 
car, carHandler. The chart describes the message communication between the 
objects, with time propagating from top to bottom. The chart of Fig. 1 is uni- 
versal. Whenever its activation message occurs, i.e., the car receives the message 
setDest from the environment, the sequence of messages in the chart should 
occur in the following order: the car sends a departure request departReq to 
the car handler, which sends a departure acknowledgment departAck back to 
the car. The car then sends a start message to the cruiser in order to activate 
the engine, and the cruiser responds by sending started to the car. Finally, the 
car sends engage to the cruiser and now the car can depart from the terminal. 

A scenario in which a car approaches the terminal is described in Fig. 2. 
This chart is also universal, but here instead of having a single message as an 
activation, the chart is activated by the prechart shown in the upper part of the 
figure (in dashed line-style, and looking like a condition, since it is conditional 
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setDest 



cruiser car carHandler 



Start 


departReq 


departAck 




started 


engage 





Fig. 1. Perform Departure 



in the cold sense of the word — a notion we explain below): in the prechart, the 
message departAck is communicated between the car handler and the car, and 
the message alertlOO is communicated between the proximity sensor and the 
car. If these messages indeed occur as specified in the prechart, then the body 
of the chart must hold: the car sends the arrival request arrivReq to the car 
handler, which sends an arrival acknowledgment arrivAck back to the car. 



proxSensor car carHandler 





alertlOO 


departAck 












arrivReq 




arrivAck 





Fig. 2. Perform Approach 



Figs. 3 and 4 are existential charts, depicted by dashed borderlines. These 
charts describe two possible scenarios of a car approaching a terminal: stop at 
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terminal and pass through terminal, respectively. Since the charts are existential, 
they need not be satisfied in all runs; it is only required that for each of these 
charts the system has at least one run satisfying it. In an iterative development 
of LSC specifications, such existential charts may be considered informal, or 
underspecified, and can later be transformed into universal charts specifying 
the exact activation message or prechart that is to determine when each of the 
possible approaches happens. 



proxSensor cruiser car carHandler 



alertStop 




arrivReq 


arrivAck 






disengage 


stop 





Fig. 3. Stop at terminal 



car carHandler 

arrivReq 
arrivAck 
departReq 
departAck 



Fig. 4. Pass through terminal 



The simple universal chart in Fig. 5 requires that when the proximity sensor 
receives the message comingClose from the environment, signifying that the 
car is getting close to the terminal, it sends the message alertlOO to the car. 
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This prevents a system from satisfying the chart in Fig. 2 by never sending the 
message alert 100 from the proximity sensor to the car, so that the prechart is 
never satisfied and there is no requirement that the body of the chart hold. 



' , comingClose 

f 

proxSensor car 
alert 100 



Fig. 5. Coming close to terminal 



The set of charts in Figs. 1-5 can be considered as an LSC specification for 
(part of) the railcar system. Our goal in this paper is to develop algorithms to 
decide, for any such specification, if there is a satisfying object system and, if so, 
to synthesize one automatically. As mentioned in the introduction, what makes 
our goal both harder and more interesting is in the treatment of a set of charts, 
not just a single one. 

3 LSC Semantics 

The semantics of the LSC language is defined in P?99l . and we now explain 
some of the basic definitions and concepts of this semantics using the railcar 
example. 

Consider the Perform Departure chart of Fig. 1. In Fig. 6 it appears with 
a labeling of the locations of the chart. The set of locations for this chart is 
thus: 

{{cruiser, 0), {cruiser, 1), {cruiser, 2), {cruiser, 3), {car, 0), {car, 1), {car, 2), 
{car, 3), {car, 4), {car, 5), {carHandler, 0), {carHandler, 1), {carHandler, 2)} 

The chart defines a partial order <m on locations. The requirement for order 
along an instance line implies, for example, (car, 0) <m (car, 1). The order in- 
duced from message sending implies, for example, (car, 1) <m {carHandler,!). 
From transitivity we get that {car,0) <m {carHandler, 1). 
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setDest 



cruiser car carHandler 

0 0 0 
departReq 



started 



engage 



departAck 



Fig. 6. 



One of the basic concepts used for defining the semantics of LSCs, and later 
on in our synthesis algorithms, is the notion of a cut. A cut through a chart 
represents the progress each instance has made in the scenario. Not every “slice”, 
i.e., a set consisting of one location for each instance, is a cut. For example, 

{{cruiser, 1), (car, 2), {carHandler, 2)) 

is not a cut. Intuitively the reason for this is that to receive the message start 
by the cruiser (in location (cruiser,!)), the message must have been sent, so 
location {car, 3) must have already been reached. 

The cuts for the chart of Fig. 7 are thus: 

{{{cruiser, 0), {car, 0), {carHandler, 0)), {{cruiser, 0), {car, 1), {carHandler, 0)), 
{{cruiser, 0), {car, 1), {carHandler, 1)), {{cruiser, 0), {car, 1), {carHandler, 2)), 
{{cruiser, 0), {car, 2), {carHandler, 2)), {{cruiser, 0), {car, 3), {carHandler, 2)), 
{{cruiser, 1), {car, 3), {carHandler, 2)), {{cruiser, 2), {car, 3), {carHandler, 2)), 
{{cruiser, 2), {car, 4), {carHandler, 2)), {{cruiser, 2), {car, 5), {carHandler, 2)), 
{{cruiser, 3), {car, 5), {carHandler, 2))} 

The sequence of cuts in this order constitutes a run. The trace of this run 
is: 

{env, car. setDest), {car, car Handler. departReq), {carHandler, car. departAck), 
{car, cruiser. start), {cruiser, car. started), {car, cruiser. engage) 

This chart has only one run, but in general a chart can have many runs. 
Consider the chart in Fig. 7. From the initial cut (0,0, 0,0) 0 it is possible to 

^ We often omit the names of the objects, for simplicity, when listing cuts. 
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progress either by the car sending departReq to the car handler, or by the 
passenger sending pressButton to the destPanel. Similarly there are possible 
choices from other cuts. Fig. 8 gives an automaton representation for all the 
possible runs. This will be the basic idea for the construction of the synthesized 
state machines in our synthesis algorithms later on. Each state, except for the 
special starting state sq, represents a cut and is labeled by the vector of locations. 
Successor cuts are connected by edges labeled with the message sent. Assuming 
a synchronous model we do not have separate edges for the sending and receiving 
of the same message. A path starting from sq that returns to sq represents a 
run. 



setDest 






car car handler destPanel passenger 


0 




0 0 




0 


1 


departReq 


1 1 


pressButton 


1 








2 


departAck 


2 2 


flashSign 


2 











Fig. 7. 




Fig. 8. 
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Here are two sample traces from these runs: 

{env, car.setDest), {car, car Handler. depart Req), {carHandler, car. depart Ack), 
{passenger, destPanel.pressButton), {destPanel, passenger. flashSign) 

(env, car.setDest), (car, carHandler.departReq), (passenger, destPanel. 
pressButton) , 

(carHandler, car. depart Ack), (destPanel, passenger. flashSign) 

As part of the “liveness” extensions, the LSC language enables 
forcing progress along an instance line. Each location is given a temperature 
hot or cold, graphically denoted by solid or dashed segments of the instance 
line. A run must continue down solid lines, while it may continue down dashed 
lines. Formally, we require that in the final cut in a run all locations are cold. 
Consider the perform approach scenario appearing in Fig. 9. The dashed seg- 
ments in the lower part of the car and carHandler instances specify that it is 
possible that the message arrivAck will not be sent, even in a run in which the 
prechart holds. This might happen in a situation where the terminal is closed or 
when all the platforms are full. 



proxSensor car carHandler ' 





alertlOO 


departAck 












arrivReq 




arrivAck 







Fig. 9. 



When defining the languages of a chart in (DH99IJ . messages that do not 
appear in the chart are not restricted and are allowed to occur in-between the 
messages that do appear, without violating the chart. This is an abstraction 
mechanism that enables concentrating on the relevant messages in a scenario. In 
practice it may be useful to restrict messages that do not appear explicitly in the 
chart. Each chart will then have a designated set of messages that are not allowed 
to occur anywhere except if specified explicitly in the chart; and this applies even 
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if they do not appear anywhere in the chart. A tool may support convenient 
selection of this message set. Consider the perform departure scenario in 
Fig. 1. By taking its set of messages to include those appearing therein, but 
also alertlOO, arrivReq and arrivAck, we restrict these three messages from 
occurring during the departure scenario, which makes sense since we cannot 
arrive to a terminal when we are just in the middle of departing from one. 

As in , an LSC specification is defined as: 

LS = {M, amsg, mod), 

where M is a set of charts, and amsg and mod are the activation messagetH and 
the modes of the charts (existential or universal), respectively. 

A system satisfies an LSC specification if, for every universal chart and every 
run, whenever the activation message holds the run must satisfy the chart, and 
if, for every existential chart, there is at least one run in which the activation 
message holds and then the chart is satisfied. Formally, 

Definition 1. A system S satisfies the LSC specification 
LS = (M, amsg, mod), 
written S \= LS, if: 

1. Vm G M, mod{m) = universal V 77 C Cm 

2. Vm G M, mod{m) = existential ^ 3rj C^ fl Cm 7 ^ 0 

Here is the trace set of object system S on the sequence of directed 
requests t]. We omit a detailed definition here, which can be found, e.g., in 
||HKpO 0 | . Cm is the language of the chart m, containing all traces satisfying the 
chart. We say that an LSC specification is satisfiable if there is a system that 
satisfies it. 

4 Consistency of LSCs 

Our goal is to automatically construct an object system that is correct with 
respect to a given LSC specification. When working with an expressive language 
like LSCs that enables specifying both necessary and forbidden behavior, and 
in which a specification is a well-defined set of charts of different kinds, there 
might very well be self contradictions, so that there might be no object system 
that satisfies it. 

Consider an LSC specification that contains the universal charts of Figs. 10 
and 11. The message setDest sent from the environment to the car activates Fig. 
10, which requires that following the departReq message, departAck is sent 
from the car handler to the car. This message activates Fig. 11, which requires 
the sending of engage from the car to the cruiser before the start and started 
messages are sent, while Fig. 10 requires the opposite ordering. A contradiction. 
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' ' ' . ^ setDest 



cruiser car carHandler 



Start 


departReq 


departAck 




started 


engage 





Fig. 10. 

; departAck ^ 

^ 

cruiser car carHandler 

engage 

start 

started 



Fig. 11. 



This is only a simple example of an inconsistency in an LSC specification. 
Inconsistencies can be caused by such an “interaction” between more than two 
universal charts, and also when a scenario described in an existential chart can 
never occur because of the restrictions from the universal charts. In a compli- 
cated system consisting of many charts the task of finding such inconsistencies 
manually by the developers can be formidable, and algorithmic support for this 
process can help in overcoming major problems in early stages of the analysis. 



® In the general case we allow a prechart instead of only a single activation message. 
However, in this paper we provide the proofs of our results for activation messages, 
but they can be generalized rather easily to precharts too. 
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4.1 Consistency = Satisfiability 

We now provide a global notion of the consistency of an LSC specification. This 
is easy to do for conventional, existential MSCs, but is harder for LSCs. In 
particular, we have to make sure that a universal chart is satisfied by all runs, 
from all points in time. 

We will use the following notation: is the alphabet denoting messages 

sent from the environment to objects in the system, while Aout denotes messages 
sent between the objects in the system. 

Definition 2. An LSC specification LS = {M,amsg,mod) is consistent if 
there exists a nonempty regular language C\ C {A^ ■ A*^^)* satisfying the fol- 
lowing properties: 

mod{mj)— universal 

2. Vw S £i Va S Ain 3r S Al^t) s.t. w ■ a ■ r € Ci. 

3. \/w € Cl, w = X ■ y ■ z, y £ A^ ^ x £ Ci. 

4- Vm £ M, mod{m) = existential => Cm H £i 0. 

The language C\ is what we require as the set of satisfying traces. Clause 1 
in the definition requires all universal charts to be satisfied by all the traces in 
Cl, Clause 2 requires a trace to be extendible if a new message is sent in from 
the environment. Clause 3 essentially requires traces to be completed before new 
messages from the environment are dealt with, and Clause 4 requires existential 
charts to be satisfied by traces from within Ci. 

Now comes the first central result of the paper, showing that the consistency 
of an LSC specification is a necessary and sufficient condition for the existence 
of an object system satisfying it. 

Theorem 1. A specification LS is satisfiable if and only if it is consistent. 
Proof. Appears in the Appendix. 

A basic concept used in the proof of Theorem Q] is the notion of a global 
system automaton, or a GSA. A GSA describes the behavior of the entire 
system — the message communication between the objects in the system in 
response to messages received from the environment. A rigorous definition of 
the GSA appears in the appendix. Basically, it is a finite state automaton with 
input alphabet consisting of messages sent from the environment to the system 
{Ain), and output alphabet consisting of messages communicated between the 
objects in the system {Aout)- The GSA may have null transitions, transitions 
that can be taken spontaneously without the triggering of a message. We add a 
fairness requirement: a null transition that is enabled an infinite number of 
times must be taken an infinite number of times. A fair cycle is a loop of states 
connected by null transitions, which can be taken repeatedly without violating 
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the fairness requirement. We require that the system has no fair cycles, thus 
ensuring that the system’s reactions are finite. 

In the proof of Theorem n (the Consistency => Satisfiability direction in the 
Appendix) we show that it is possible to construct a GSA satisfying the specifi- 
cation. This implies the existence of an object system (a separate automaton for 
each object) satisfying the specification. Later on, when discussing synthesis, we 
will show methods for the distribution of the GSA between the objects to obtain 
a satisfying object system. In section 5.5 we show that the fairness requirement 
is not essential for our construction — it is possible to synthesize a satisfying 
object system that does not use null transition and the fairness requirement, 
although it does not generate the most general language. 

4.2 Deciding Consistency 

It follows from Theorem Q] that to prove the existence of an object system satis- 
fying an LSG specification LS, it suffices to prove that LS is consistent. In this 
section we present an algorithm for deciding consistency. 

A basic construction used in the algorithm is that of a deterministic finite 
automaton accepting the language of a universal chart. Such an automaton for 
the chart of Fig. 7 is shown in Fig. 12. The initial state sq is the only accepting 
state. The activation message setDest causes a transition from state So> Etnd 
the automaton will return to sq only if the messages departReq, departAck, 
pressButton and flashSign occur as specified in the chart. Notice that the 
different orderings of these messages that are allowed by the chart are represented 
in the automaton by different paths. Each such message causes a transition to a 
state representing a successor cut. The self transitions of the nonaccepting states 
allow only messages that are not restricted by the chart. The initial state Sq has 
self transitions for message comingClose sent from the environment and for all 
other messages between objects in the system. To avoid cluttering the figure we 
have not written the messages on the self transitions. 

The construction algorithm of this automaton and its proof of correctness 
are omitted from this version of the paper. 

An automaton accepting exactly the runs that satisfy all the universal charts 
can be constructed by intersecting these separate automata. This intersection 
automaton will be used in the algorithm for deciding consistency. The idea is 
to start with this automaton, which represents the “largest” regular language 
satisfying all the universal charts, and to systematically narrow it down in order 
to avoid states from which the system will be forced by the environment to violate 
the specification. At the end we must check that there are still representative 
runs satisfying each of the existential charts. 

Here, now is our algorithm for checking consistency: 

Algorithm 2 1. Find the minimal DFA A = {A, S, sq, p, F) that accepts the 
language 

n 

rrij^M, mod{mj)— universal 
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comingClose 




Fig. 12. 



(The existence of such an automaton follows from the discussion above and 
is proved in the full version of the paper \HK q9i^ . ) 

2. Define the sets Badi C S, for i = 0, 1, ..., as follows: 

Bado = {s G S' I 3a G Ai^, s.t. Vx G A*^^ p(s, a - x) ^ F}, 

Badi = {s G S I 3a G Ain, s.t. Vx G A*nt p{s, a • x) ^ F — Badi-\}. 

The series Badi is monotonically increasing, with Badi Q Badi+i, and since 
S is finite it converges. Let us denote the limit set by Badmax- 

3. From A define a new automaton A' = (A, S, sq, p, F'), where the set of 
accepting states has been reduced to F' = F — Badmax 

4 . Further reduce A, by removing all transitions that lead from states in S — F' , 
and which are labeled with elements of Ain . This yields the new automaton 

A". 

5. Check whether C{A”) 0 and whether, in addition. Cm, H C{A”) 0 for 

each mi G M with modfmi) = existential . If both are true output YES; 
otherwise output NO. 



Proposition 1. The algorithm is correct: given a specification LS , it terminates 
and outputs Y ES iff LS is consistent. 

Proof. Omitted in this version of the paper. See |HKg99| . 

In case the algorithm answers YES, the specification is consistent and it is 
possible to proceed to automatically synthesize the system, as we show in the 
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next section. However, for the cases where the algorithm answers NO, it would 
be very helpful to provide the developer with information about the source of 
the inconsistency. Step 5 of our algorithm provides the basis for achieving this 
goal. Here is how. 

The answer is NO if C{A") — 0 or if there is an existential chart rrii such that 
Crrii n C{A") = 0. In the second case, this existential chart is the information 
we need. The first case is more delicate: there is a sequence of messages sent 
from the environment to the system (possibly depending on the reactions of the 
system) that eventually causes the system to violate the specification. Unlike 
verification against a specification, where we are given a specific program or 
system and can display a specific run of it as a counter-example, here we want 
to synthesize the object system so we do not yet have any concrete runs. A 
possible solution is to let the supporting tool play the environment and the user 
play the system, with the aim of locating the inconsistency. The tool can display 
the charts graphically and highlight the messages sent and the progress made 
in the different charts. After each message sent by the environment (determined 
by the tool using the information obtained in the consistency algorithm), the 
user decides which messages are sent between the objects in the system. The 
tool can suggest a possible reaction of the system, and allow the user to modify 
it or choose a different one. Eventually, a universal chart will be violated, and 
the chart and the exact location of this violation can be displayed. 



5 Synthesis of FSMs from LSCs 

We now show how to automatically synthesize a satisfying object system from 
a given consistent specification. We first use the algorithm for deciding consis- 
tency (Algorithm El), relying on the equivalence of consistency and satisfiability 
(Theorem QJ to derive a global system automaton, a GSA, satisfying the spec- 
ification. Synthesis then proceeds by distributing this automaton between the 
objects, creating a desired object system. 

The synthesis is demonstrated on our example, taking the charts in Figs. 1-5 
to be the required LSC specification. For the universal charts. Figs. 1, 2 and 
5, we assume that the sets of restricted messages (those not appearing in the 
charts) are { alertStop, alertlOO, arrivReq, arrivAck, disengage, stop }, 
{ departReq, start, started, engage } and { departAck}, respectively. 

Figs. 13, 14 and 15 show the automata for the perform departure, per- 
form approach and coming close charts, respectively. Notice that in Fig. 14 
there are two accepting states sq and si, since we have a prechart with messages 
departAck and alertlOO that causes activation of the body of the chart. To 
avoid cluttering the figures we have not written the messages on the self tran- 
sitions. For nonaccepting states these messages are the non-restricted messages 
between objects in the system, while for accepting states we take all messages 
that do not cause a transition from the state, including messages sent by the 
environment. 
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comingClose 




Fig. 13. Automaton for Perform Departure 




arrivAck 



Fig. 14. Automaton for Perform Approach 
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Fig. 15. Automaton for Coming Close 



The intersection of the three automata of Figs. 13, 14 and 15 is shown in Fig. 
16. It accepts all the runs that satisfy all three universal charts of our system. 




Fig. 16. The Intersection Automaton 



The global system automaton (GSA) derived from this intersection automa- 
ton is shown in Fig. 17. The two accepting states have as outgoing transitions 
only messages from the environment. This has been achieved using the tech- 
niques described in the proof of Theorem Q (see the Appendix) . Notice also the 
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existence of runs satisfying each of the existential charts. We have used the path 
extraction methods of Section 5.5 to retain these runs. 




Fig. 17. The Global System Automaton 



After constructing the GSA, the synthesis proceeds by distributing the au- 
tomaton between the objects, creating a desired object system. To illustrate the 
distribution we focus on a subautomaton of the GSA consisting of the states 
90 ) 91)927 93,94 and q^, as appearing in Fig. 17. This subautomaton is shown in 
Fig. 18. In this figure we provide full information about the message, the sender 
and receiver, since this information is important for the distribution process. 

In general, let A = (Q,qo,S) be a GSA describing a system with objects 
O = {Oi, ...,On} and messages S = Sin U Sent- Assume that A satisfies the 
LSG specification LS. Our constructions employ new messages taken from a set 
Seal, where Scot O H = 0. They will be used by the objects for collaboration in 
order to satisfy the specification, and are not restricted by the charts in LS. 

There are different ways to distribute the global system automaton between 
the objects. In the next three subsections we discuss three main approaches — 
controller object, full duplication, and partial duplication — and illus- 
trate them on the GSA subautomaton of Fig. 18. The first approach is trivial and 
is shown essentially just to complete the proof of the existence of an object sys- 
tem satisfying a consistent specification. The second method is an intermediate 
stage towards the third approach, which is more realistic. 
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Fig. 18. Subautomaton of the GSA 



5.1 Controller Object 

In this approach we add to the set of objects in the system O an additional 
object Ocon which acts as the controller of the system, sending commands to all 
the other objects. These will have simple automata to enable them to carry out 
the commands. 

Let \Scoi \ = \Ain \ + and let / be a one-to-one function 

/ : Ain U Aont ^col 

We define the state machine of the controller object Ocon to be {Q, qo, Scon), 
and the state machines of object Oi € O to be {{qOi},QOi,Soi)- 

The states and the initial state of Ocon are identical to those of the GSA. The 
transition relation Scon and the transition relations Sq^ are defined as follows: 

If (g, a, q') G <5 where a G Ain, a = (enu, Oi.Oi) then 

{q,f{a),q') G Scon and {qOi,<7i/Ocon-f{a),qOi) G Soi- 

If (g, /a,q') G S where a G Aout, a = {Oi, Oj.Uj) then 

{q, /0^.f{a),q') G Scon and {qOi, f{a)/Oj.aj,qOi) G Soi- 

This construction is illustrated in Fig. 19, which shows the object system ob- 
tained by the synthesis from the GSA of Fig. 18. It includes the state machine of 
the controller object Ocon, and the transitions of the single-state state machines 
of the objects carHandler, car and proxSensor. 

The size of the state machine of the controller object Ocon is equal to that 
of the GSA, while all other objects have state machines with one state. Section 
5.4 discusses the total complexity of the construction. 
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/carHandler->send_ arrivAck_to_car 




carHandler: send_arrivAck to_car/car->arrivAck 

send_departAck_to_car/car->departAck 

car: send_departReq_to_carHandler/carHandler->departReq 

send_disengage_to_cruiser/cruiser->disengage 
setDest/control->car_got_setDest 

proxSensor: send_alertStop_to_car/car->alertStop 

comingClose/control->proxSensor_got_comingClose 



Fig. 19. Controller 



5.2 Full Duplication 

In this construction there is no controller object. Instead, each object will have 
the state structure of the GSA, and will thus “know” what state the GSA would 
have been in. 

Recalling that A = (Q, go, S) is the GSA, let k be the maximum outdegree of 
the states in Q. A labeling of the transitions of A is a one-to-one function tn: 

tn : 5 ^ {1, ..., k} 

Let I Ecoi I = k and let / be a one-to-one function 

/ : {1, ..., fc} — >■ Seal 

The state machine for object Oi in O is defined to be (Q, go, 6oi)- 
If {q,a,q') € S, where a € Ain, a = (env, Oi-di) and a' = f{tn{q,a,q')) G Scoi, 
then {q,aJOi+i.a') G 5oi and for every j yf i, {q,a' /Oj+i.a' ,q') G 5oy 
If (g, /a, g') G i5, where a G Aout, a = {Oi, Oj.dj) and a' = f{tn{q,/a,q')) G Scot, 
then (g, jOj.af, O boi and for every j i, (g, a' jOj+i.a' , q') G (5o„ - 

This construction is illustrated in Fig. 20 on the sub-GSA of Fig. 18. The 
maximal outdegree of the states of the GSA in this example is 2, and the set of 
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collaboration messages is Scot = {1> 2}. Again, complexity is discussed in Section 

5.4. 




I l/proxSensor'>l 

l/proxSensor->l 2/carHandler->departReq;proxSensor->2 



l/cmiser->diseng^^; proxSensor->l 



Vj 







Fig. 20. Full Duplication 



5.3 Partial Duplication 

The idea here is to distribute the GSA as in the full duplication construction, 
but to merge states that carry information that is not relevant to this object in 
question. In some cases this can reduce the total size, although the worst case 
complexity remains the same. 

The state machine of object Oi is defined to be {Qoi U <lidie,(lo,Soi), where 
QOi C Q is defined by 

3q' € Q 3a€ A^ut s.t. 
a= /a,q') € S or 

3q' G Q 3a G s.t. 
a = {env, Oi.at), {q', a,q) G S 

Thus, in object Oi we keep the states that the GSA enters after receiving a 
message from the environment, and the states from which Oi sends messages. 
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Let \Scoi \ = IQI) and let / be a one-to-one function 

f ■ Q ^ Scol 

The transition relation 5 oi for object Oi is defined as follows: 

If {q,a,q') €S,a= {env,Oi.ai) then (q, (Ti/Oi+i. f{q'),q') € So,- 

If {q,/a,q') €S,a = (Oj,Oi.ai), then 

either q' G Qo^ and then {q, /Oi.ai;Oj+i.f{q'),q') G Sqj, 

or q' ^ Qoj and then (q, /Oi.ai;Oj+i.f{q'),q,die) G Soy 

If (? G Qoi, q' G Qoi then {q, f{q'),q') G So,- 

If (7 G Qoi, q' ^ Qo, then {q, f{q'),qidie) G So,- 

For every q G Qo„ {qidie, f{q),q) G So,- 



carHandler: 



f 




car: 0 





Fig. 21. Partial Duplication 



This construction is illustrated in Fig. 21 . The states of the GSA of Fig. 18 
that were eliminated are qi,q2 , 94 and 95 for carHandler, 90 and 93 for car and 
90 7 92,93 and 94 for proxSensor. 
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5.4 Complexity Issues 

In the previous sections we showed how to distribute the satisfying GSA between 
the objects, to create an object system satifying the LSC specification LS. We 
now discuss the size of the resulting system, relative to that of LS. 

We take the size of an LSC chart m to be 

\m\ = \dom(m)\ = of locations in m\, 

and the size of an LSC specification LS = (M, amsg, mod) to be 

ILS'I = \m\. 

m£M 

The size of the CSA A = {Q, qo, S) is simply the number of states \Q\. We ignore 
the size of the transition function 5 which is polynomial in the number of states 
\Q\. Similarly, the size of an object is the number of states in its state machine. 
Let LS be a consistent specification, where the universal charts in M are 
Let A be the satisfying CSA derived using the algorithm for 
deciding consistency ( AlgorithmED . A was obtained by intersecting the automata 
Ai, A 2 , At that accept the runs of charts mi , m 2 , respectively, and then 

performing additional transformations that do not change the number of states 
in A. The states of automaton At correspond to the cuts through chart m^, as 
illustrated, for example, in Fig. 9. 

Proposition 2. The number of cuts through a chart m with n instances is 
bounded by |m|". 

Proof. Omitted in this version of the paper. 

This is consistent with the estimate given in for their analysis of the 

complexity of model checking for MSCs. We now estimate the size of the CSA. 

Proposition 3. The size of the GSA automaton A constructed in the proof of 
Theorem Q satisfies 



t 

|A| < ^ 

i=l 

where Ui is the number of instances appearing in chart m^, n is the total number 
of instances appearing in LS, and t is the number of universal charts in LS. 

Proof. Omitted in this version of the paper. 

The size of the GSA A is thus polynomial in the size of the specification 
LS, if we are willing to treat the number of objects in the system and the 
number of charts in the specification as constants. In some cases a more realistic 
assumption would be to fix one of these two, in which case the synthesized 
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automaton would be exponential in the remaining one. The time complexity of 
the synthesis algorithm is polynomial in the size of A. 

The size of the synthesized object system is determined by the size of the 
GSA A. In the controller object approach (Section 5.1), the controller object is 
of size |A| and each of the other objects is of constant size (one state). In the 
full duplication approach (Section 5.2), each of the objects is of size |A|, while 
in the partial duplication approach (Section 5.3), the size of each of the objects 
can be smaller than |A|, but the total size of the system is at least |A|. 



5.5 Synthesis without Fairness Assumptions 

We have shown that for a consistent specification we can find a GSA and then 
construct an object system that satisfies the specification. This construction 
used null transitions and a fairness assumption related to them, i.e., that a null 
transition that is enabled an infinite number of times is taken an infinite number 
of times. We now show that consistent specifications also have satisfying object 
systems with no null transitions. 

Let A = (Q,qo,6) be the GSA satisfying the specification LS, derived using 
the algorithm for deciding consistency. We partition Q into two sets: Q stable, 
the states in Q that do not have outgoing null transitions, and Q transient, the 
states of which all the outgoing transitions are null transitions. Such a partition 
is possible, as implied by the proof of Theorem [0 Now, A may have a loop of null 
transitions consisting of states in Q transient- Such a loop represents an infinite 
number of paths and it will not be possible to maintain all of them in a GSA 
without null transitions. To overcome this, we create a new GSA A' = {Q' , qo, 6'), 
with Q' Q and 5' C i5, as follows. 

Let m S M be an existential chart, mod{m) = existential. A satisfies the 
specification LS, so there exists a word w, with w € LAC\^^m- Let ... be the 

sequence of states that A goes through when generating w, and let 5°, (5^... be the 
transitions taken. Since w € £m, let ii , ..., ik, such that • • • Wi^, S 

Let j be the minimal index such that j > ik and q^ G QstaUe- The new GSA 
A' will retain all the states that appear in the sequence q^,...,q^ and all the 
transitions that were used in This is done for every existential chart 

m G M. 

In addition, for every qt,qj G Q and for every a G A^„, if there exists a 
sequence of states qt, q^ , ..., q^ ,qj such that {qt, a,q^) G i5 and for every 1 < k < I 
there is a null transition 6^ G 5 between q^ and then for one such sequence 
we keep in A' the states q^, ...,q^ and the transitions 6^, 

All other states and transition of A are eliminated in going from A to A'. 

Proposition 4. The GSA A' satisfies the specification LS. 

Proof. Omitted in this version of the paper. 
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6 Synthesizing Statecharts 

We now outline an approach for a synthesis algorithm that uses the main suc- 
cinctness feature of statecharts p87j (see also [Hnnz]) , namely, concurrency, via 
orthogonal states. 

Consider a consistent specification LS = {M,amsg,mod), where the univer- 
sal charts in M are Muniversai = {wi, m 2 , ..., rnt}. In the synthesized object 
system each object Oi will have a top-level and state with t orthogonal compo- 
nent OR states, si, S 2 , ..., s*. Each Sj has substates corresponding to the locations 
of object Oi in chart rrij. 

Assume that in a scenario described by one of the charts in Muniversai, object 
Oi has to send message a to object Oj. If object Oi is in a state describing a loca- 
tion just before this sending, Oi will check whether Oj is in a state corresponding 
to the right location, and is ready to receive. (This can be done using one of the 
mechanisms of statecharts for sensing another object’s state.) The message cr 
can then be sent and the transition taken. All the other component states of Oi 
and Oj will synchronously take the corresponding transitions if necessary. 

This was a description of the local check that an object has to perform before 
sending a message and advancing to the next location for one chart. Actually, 
the story is more complicated, since when advancing in one chart we must check 
that this does not contradict anything in any of the other charts. Even more 
significantly, we also must check that taking the transition will not get the system 
into a situation in which it will not be able to satisfy one of the universal charts. 

To deal with these issues the synthesis algorithm will have to figure out which 
state configurations should be avoided. Specifically, let d be a cut through chart 
rrii. We say that C = (ci, C 2 , ..., c*) is a supercut if for every i, a is a cut 
through rrii. We say that supercut C = (c'^, C 2 , ..., c() is a successor of supercut 
C = (ci, C 2 , ..., Ct), if there exists i with succrm (ct, (j, lj),c'i) and such that for all 
k ^ i the cut cj, is consistent with communicating the message msg{j,lj) while 
in cut Cfe. 

Now, for i = 0,1,..., define the sets 

Badi C {all supercuts s.t. at least one of the cuts has at least one hot location} 
as follows: 



Bado = {C I C has no successors } 

Badi = {C\C € Badi-i or all successors of C are in Badi-i\ 

The series Badi is monotonically increasing under set inclusion, so that 
Badi C Badi+i- Since the set of all supercuts is finite the series converges. De- 
note its limit by Badmax- The point now is that before taking a transition the 
synthesized object system will have to check that it does not lead to a supercut 
in Badmax- 

The construction is illustrated in Figs. 22, 23, 24 and 25 which show the 
statecharts for car, carHandler, proxSensor and cruiser, respectively, obtained 
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Perform_Departure 




Perform_Approach 




cruiser- >stop 



Fig. 22. Statechart of car 



from the railcar system specification. Notice that an object that does not actively 
participate in some universal chart, does not need a component in its statechart 
for this scenario, for example proxSensor does not have a Perform Departure 
component. Notice the use of the in predicate in the statechart of the proxSensor 
for sensing if the car is in the stop state. 




Fig. 23. Statechart of carHandler 



The number of states in this kind of synthesized statechart-based system is on 
the order of the total number of locations in the charts of the specification. Now, 
although in the GSA solution the number of states was exponential in the number 
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Fig. 24. Statechart of proxSensor 




Fig. 25. Statechart of cruiser 



of universal charts and in the number of objects in the system, which seems to 
contrast sharply with the situation here, the comparison is misleading; the guards 
of the transitions here may involve lengthy conditions on the states of the system. 
In practice, it may prove useful to use OBDD’s for efficient representation and 
manipulation of conditions over the system state space. 
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APPENDIX: Proof of Theorem ^ 

Proof. The proof relies on the definitions of an object system appearing in 
||HKp55| , somewhat modified. In |HKp55| a basic computational model for 
object-oriented designs is presented. It defines the behavior of systems com- 
posed of instances of object classes, whose behavior is given by conventional 
state machines. In our work we assume a single instance of each class during 
the entire evolution of the system — we do not deal with dynamic creation and 
destruction of instances. We assume all messages are synchronous and that there 
are no failures in the system — every message that is sent is received. 

(=>) Let the object system S be such that S ^ LS. We let £5 = 
and show that £g satisfies the four requirements of £1 in the definition of a 
consistent specification, Def. El 

(1) From the definition of an object system it follows that £g is regular and 
nonempty. The system S satisfies the specification LS. Hence, if we set £ = 
r\m,€M,mod{mi)=universai , Clause 1 of the definition of Satisfaction (Def. C} 
implies V?7 £g C £. Thus, £5 = U,,£g C £. 

(2 and 3) Let w £ £ 3 . There exists a sequence of directed requests sent by 
the environment, rj = 0 °.(t° ■ O^.a^ ■ ■ ■ O^.cr", such that w is the behavior of the 
system S while reacting to the sequence of requests rj. Now, w belongs to the 
trace set of S on ry, so that w = wq-wi ■ ■ ■ Wn, Wi £ A*, first{wi) = (enw, OLcr*), 
and there exists a sequence of stable configurations cq, Ci, ..., c„+i such that cq 
is initial and for all 0 < i < n, Zeads(ci, u>i, c^+i). The leads predicate is defined 
in |HKp55| . It describes the reaction of the system to a message sent from the 
environment to the system that causes a transition of the system from the stable 
configuration Ci to a new stable configuration Ci+i, passing through a set of 
unstable configurations. The trace describing this behavior is Wi. 

As to Clause 2 in Def. El the system reaches the stable configuration c„+i at 
the end of the reactions to 77 . For any object Oi and request a, there is a reaction 
of the system to the directed request Oi.a from the stable configuration c„+i. If 
we denote by 7c„+i the word that captures such a reaction, Wn+i is in the trace 
set of S on Oi.a from c„+i, from which we obtain w ■ Wn+i £ £3. 

For Clause 3, assuming that w = x ■ y ■ z, y £ Ain, there exists i with 
X = Wo ■ ■ ■ Wi and therefore x £ £3. 

(4) The system S satisfies the specification LS. Hence, from Clause 2 of Def. E 
we have 

Vm £ M, mod{m) = existential 3r] £g fl £m 0 
Since £g C £5 = U^£g, we obtain 

Vm £ M, mod{m) = existential £3 fl £m yf 0 

(<t=) Let LS be consistent. We have to show that there exists an object 
system S satisfying LS. To prove this we define the notion of a global system 
automaton, or a GSA. We will show that there exists a GSA satisfying the 
specification and that it can be used to construct an object system satisfying 
LS. 
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A GSA A describing a system with objects O = {0i,...,0„} and message 
set 

A = U Uout is a tuple A = {Q,qo,S), where Q is a finite set of states, go 
is the initial state, and SQQxBxQ is a transition relation. Here is a 
set of labels, each one of the form cr/r, where a G Ain = (env) x (O.Uin) and 
r G A*nt = ((O) X (O.Sout))*- Let r] = -a^ ■ ■ ■ where a* G Ain- The trace set 

of A on ?7 is the language C [A* U A“), such that a word w = wq ■ wi ■ W 2 ■ ■ ■ 
is in iff Wi = G A*^^ and there exists a sequence 

of states with g°^ = go, and such that for all i,j 

(g®A ^ and (g**^i , ^ 

The satisfaction relation between a GSA and an LSG specification is defined 
as for object systems: the GSA A satisfies LS = {M,amsg,mod), written A ^ 
LS, if Vm G M, mod{m) = universal V 77 C Cm, and Vm G M, 
mod{m) = existential => 3rj C\ fl Cm ^ 0- 

Since LS is consistent, there exists a language C\ as in Def. 0 Since C\ is 
regular, there exists a DFA A = {A,S,so,p,F) accepting it. We may assume 
that A is minimal, so all states in S are reachable and each state leads to some 
accepting state. 

From Glause 2 of Def. 0 for every accepting state s of A and for every 
a G Ain there exists an outgoing transition with label a leading to a state that 
is connected to an accepting state by a path labeled Formally, 

Vs G A Va G Ain p(s,a) = s' => 3r G A*^^ s.t. p{s',r) G F 

From Glause 3 of Def. 0 no nonaccepting states of A have any outgoing 
transitions with label a G Ain- This is true since if there were such a state s ^ F 
reachable from the initial state by x, we would have p(so, x) = s and p(s, a) = s', 
and from s' we can reach an accepting state p{s' , z) G F. Then w = x-a- z would 
violate Glause 3. 

We have shown that A has transitions labeled by Ain only for accepting 
states, and for an accepting state there is such a transition for every letter 
from Ain- We now convert A into an NFA A' with the same properties, but, in 
addition, accepting states do not have outgoing transitions labeled Aout- This 
can be done by adding, for each state s G A, an additional state s' ^ F. All 
incoming transitions into s are duplicated so that they also enter s' and all 
outgoing transitions from s labeled Aout are transferred to s'. A' accepts the 
same language as A since it can use nondeterminism to decide if to take a 
transition to s or s'. 

We now transform the automaton A' into a GSA B by changing all transitions 
with a label from Aout into null transitions with that letter as an action. All 
transitions with a label from Ain are left unchanged. 

We have to show that B satisfies the specification LS. From the construction 
of B, we have 



Bb = = Cl 
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This proves Clause 1 of Def. 0 
Now, from Clause 4 of Def. 0 we have 

Vm G M, mod{m) = existential ^ Cm fi £i 0 

But since £i = this becomes 

Vm G M, mod{m) = existential => 3?7 fi Cm ^ 0, 




Hence, 



Vm G M,mod{m) = universal ^ C\ Q Cm, 



yielding 



Vm G M, mod{m) = universal Vty O £j C £^ 



thus proving Clause 2 of Def. 0 
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Abstract. This paper is a review of some of the major applications 
of finite-state transducers in Natural Language Processing ranging from 
morphological analysis to finite-state parsing. The analysis and gener- 
ation of inflected word forms can be performed efficiently by means of 
lexical transducers. Such transducers can be compiled using an extended 
regular expression calculus with restriction and replacement operators. 
These operators facilitate the description of complex linguistic phenom- 
ena involving morphological alternations and syntactic patterns. Because 
regular languages and relations can be encoded as finite-automata, new 
languages and relations can be derived from them directly by the finite- 
state calculus. This is a fundamental advantage over higher-level linguis- 
tic formalisms. 



1 Introduction 

The last decade has seen a substantial surge in the use of finite-state methods 
in many areas of natural language processing. This is a remarkable comeback 
considering that in the dawn of modern linguistics, finite-state grammars were 
dismissed as fundamentally inadequate. Noam Chomsky’s seminal 1957 work, 
Syntactic Structures jS|, includes a short chapter devoted to “finite state Markov 
processes” , devices that we now would call weighted finite-state automata. In this 
section Chomsky demonstrates in a few paragraphs that 

English is not a finite state language, (p. 21) 

In any natural language, a sentence may contain discontinuous constituents em- 
bedded in the middle of another discontinuous pair as in “Ifi . . . either 2 . . . or 2 
. . . theni ...” It is impossible to construct a finite automaton that keeps track 
of an unlimited number of such nested dependencies. Any finite-state machine 
for English will accept strings that are not well- formed. 

The persuasiveness of Syntactic Structures had the effect that, for many 
decades to come, computational linguists directed their efforts towards more 
powerful formalisms. Finite-state automata as well as statistical approaches dis- 
appeared from the scene for a long time. Today the situation has changed in 
a fundamental way: statistical language models are back and so are finite-state 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 34-gHl 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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automata, in particular, finite-state transducers. One reason is that there is a 
certain disillusionment with high-level grammar formalisms. Writing large-scale 
grammars even for well-studied languages such as English turned out to be a 
very hard task. With easy access to text in electronic form, the lack of robustness 
and poor coverage became frustrating. But there are other, more positive reasons 
for reasons for the renewed interest in finite-state techniques. In phonology, it 
was discovered rather early that the kind formal descriptions of phonological 
alternations used by linguists were, against all appearances, finite-state models. 
In syntax, it became evident that although English as a whole is not a finite- 
state language, there are nevertheless subsets of English for which a finite-state 
description is not only adequate but also easier to construct than an equiva- 
lent phrase-structure grammar. Finally, considerable process has been made in 
developing special finite-state formalisms that are suited for the description of 
linguistic phenomena and, along with them, compilers that efficiently produce 
automata from such a description. The automata in current linguistic applica- 
tions are typically much too large and complex to be produced by hand. 

The following sections will cover these positive developments in more detail. 

2 Finite-State Morphology 

Morphology is a domain of linguistics that studies the formation of words. It is 
traditional to distinguish between surface forms and their analyses, called lem- 
mas. The lemma for a surface form such as the English word bigger typically 
consists of the traditional dictionary citation form of the word together with 
terms that convey the morphological properties of the particular form. For ex- 
ample, the lemma for bigger might be represented as big+Adj+Comp to indicate 
that bigger is the comparative form of the adjective big. 

There are two challenges in modeling natural-language morphology: 

1. Morphotactics 

Words are typically composed of smaller units of meaning, called morphemes. 
The morphemes that make up a word must be combined in a certain order: 
piti-less-ness is a word of English but *piti-ness-less is not. Most 
languages build words by concatenation but some languages also exhibit 
non-concatenative processes such as interdigitation and reduplication 0. 

2. Morphological Alternations 

The shape of a morpheme often depends on the environment: pity is realized 
as piti in the context of less, die as dy in dying. 

The basic claim of finite-state approach to morphology is the relation between 
the surface forms of a language and their corresponding lemmas can be described 
can be modeled as a regular relatior^li the relation is regular, it can be defined 
using the metalanguage of regular expressions; and, with a suitable compiler, 
the regular expression source code can be compiled into a finite-state transducer 
that implements the relation computationally. 



^ Some writers prefer the term rational relation. 
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In the resulting transducer, each path (= sequence of states and arcs) from 
the initial to a final state represents a mapping between a surface form and its 
lemma, also known as the lexical form. For example, the information that the 
comparative of the adjective big is bigger might be represented in the English 
lexical transducer by the path in Figure ^ where the zeros represent epsilon 
symbols 0 




Surface side: 



Fig. 1. A Path in a Transducer for English 



For the sake of clarity. Figure Q represents the upper and the lower side of the 
arc label separately on the opposite sides of the arc. In the rest of the paper, 
we use a more compact notation: the upper and the lower symbol are combined 
into a single label of the form upper: lower if the symbols are distinct. Identity 
pairs, e.g. b:b, are reduced to a single symbol. In standard notation, the path 
in Figure n is labeled as 

b i g 0:g +Adj : 0 0:e +Comp:r. 

An important characteristic of the finite-state transducers built at Xerox 
is that they are inherently bidirectional: there is no privileged input side. The 
path in Figure ^can be traversed matching either the form bigger to produce 
big+Adj+Comp, or vice versa. The same transducer can be used for analysis (sur- 
face input, “upward” application) or for generation (lexical input, “downward” 
application). 

A single surface string can be related to multiple lexical strings. For exam- 
ple, a morphological transducer for French applied upward to the surface string 
suis may produce the four lexical strings shown in Figure El Ambiguity in 
the downward direction is also possible, as in the relation of the lexical string 
payer+IndP+SG+Pl+Verb (“I pay”) to the surface strings paie and paye, which 
are in fact alternate spellings in standard French orthography. 

etre+IndP+SG+Pl+Verb 

suivre+IndP+SG+P2+Verb 

suivre+IndP+SG+Pl+Verb 

suivre+Imp+SG+P2+Verb payer+IndP+SG+Pl+Verb 

t t 

Lexical Transducer for French 

t { 

suis paie 

paye 

Fig. 2. Morphological Ambiguities 

^ The epsilon symbols and their placement in the string is not significant. We will 
ignore them whenever it is convenient. 
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At Xerox, such lexical transducers have been created for a great number 
of languages including most of the European languages, Turkish, Arabic, Ko- 
rean, and Japanese. The source descriptions are written using notations Irani 
^ that are helpful shorthands for ordinary regular expressions. The construc- 
tion is commonly done by creating two separate modules: a lexicon description 
that defines the morphotactics of the language and a set of rules that define 
regular alternations such as the gemination of g and the epenthetical e in the 
surface form bigger. Irregular alternations such as etre:suis are defined ex- 
plicitly in the source lexicon. The separately compiled lexicon and rule networks 
are subsequently composed together as in Figure 01 




Fig. 3. Creation of a Lexical Transducer 



Lexical transducers may contain hundreds of thousands, even millions, of 
states and arcs and an infinite number of paths in the case of languages such 
as German that in principle allow noun compounds of any length. The regular 
expressions from which such complex networks are compiled include high-level 
operators that have been developed in order to make it possible to describe 
constraints and alternations that commonly found in natural languages in a 
convenient and perspicuous way. We will describe them in the following sections. 



3 Basic Expression Calculus 

The notation used in this section comes from the Xerox finite-state calculus. 
It is described in detail in Chapter 2 of the forthcoming book by Beesley and 
Karttunen p. We use uppercase letters. A, B, etc., as variables over regular 
expressions. Lower case letters, a, b, etc., stand for symbols. There are two 
special symbols: 0, the epsilon symbol, that stands for the empty string and ?, 
the ANY symbol that represents the infinite set of symbols in some yet unknown 
alphabet. The special meaning of 0, ?, and any other symbol can be canceled 
by enclosing the symbol in double quotes. 

An atomic expression consisting of a symbol pair such as a:b denotes a 
relation containing the corresponding strings. An expression consisting of a single 
symbol such as a denotes the language consisting of “a” or, alternatively, the 
corresponding identity relation. The Xerox implementation intentionally does 
not distinguish between a and a : a. 
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Complex regular expressions can be built up from simpler ones by means 
of regular expression operators. Square brackets, [] , are used for grouping ex- 
pressions. Because both regular languages and regular relations are closed under 
concatenation and union, the following basic operators can be combined with 
any kind of regular expression: 

A I B Union. 

A B Concatenation. 

(A) Optionality; equivalent to [A I 0] . 

A+ Iteration; one or more concatenations of A. 

A* Kleene star; equivalent to (A+) . 

Although regular languages are closed under complementation and intersection, 
regular relations are not |H|; thus the following operators can be combined only 
with expressions that denote a regular language. 

~A Complement 

\A Term complement; all single symbol strings not in A. 

A & B Intersection 
A - B Subtraction (minus) 

Regular relations can be constructed by means of two basic operators: 

A . X . B Crossproduct 
A . o . B Composition 

The crossproduct operator, .x., is used only with expressions that denote a 
regular language; it constructs a relation between them. [A . x . B] designates 
the relation that maps every string of A to every string of B. 



4 Containment, Restriction, Replacement, and Marking 

The syntax (though not the descriptive power) of regular expressions can be 
extended by defining new operators that allow commonly used constructions 
to be expressed more concisely. A simple example of a trivial but convenient 
extension is the containment operator $. 

$A =def [?* A ?*] 

For example, $ [a I b] denotes all strings that contain at least one “a” or 
“b” somewhere. 

The addition of new operators can be more than just a notational conve- 
nience. A case in point is Koskenniemi ’s m RESTRICTION operator =>. 

A => L _ R Restriction; A only in the context of L _ R. 
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Here A, L and R may denote any regular language. This expression designates 
the language of strings that have the property that any string of A that occurs in 
them is immediately preceded by some string from L and immediately followed 
by some string from R. For example, a => b _ c includes all strings that contain 
no occurrence of “a” , strings like “bac-bac” that completely satisfy the condition, 
but no strings like “ab” . A special boundary symbol, . # . , is used to indicate the 
beginning or the end of the string. For example, a => _ . # . allows “a” only at 
the end of a string. 

The advantage of the restriction operator is that it encodes in a compact 
way a useful condition that is difficult to express in terms of the more primitive 
operators. The definition of [A => L _ R] is shown below. 

A => L _ R =def [--[ [--[?* L] A ?*] I [?* A -[R ?*]] ]] 

Another example of a useful high-level abstraction is the replace operator 
->. As we will see shortly, there are many constructions involving this operator. 
The simplest variant is unconstrained, obligatory replacement: 

A -> B Replacement of A by B. 

Transducers compiled from -> expressions are usually intended to be applied 
downward; they can of course be inverted and applied in the other direction. 
The component expressions, A and B, must denote regular languages but the 
expression as a whole denotes a relation. The [A -> B] relation maps any upper- 
language string to itself if the string contains no instance of A. Upper language 
strings that contain instances of A are paired with lower-language strings that 
are otherwise identical except that each A segment is replaced by some B string. 
The definition uni of simple replacement is shown below. 

A -> B =def [ [-^$[A - 0] [A .X. B]]* -$[A - 0]] 

Two replace expressions linked with a comma indicate parallel replacement. For 
example, 

a -> b, b -> a 

yields a transducer that exchanges the two letters mapping “abba” to “baab” . 

High-level abstractions like A => L _ R and A -> B are conceptually easier 
to operate with than the logically equivalent but very complex primitive for- 
mulas, just as it is easier to write complex computer programs in a high-level 
language rather than in a logically equivalent assembly language. 

Instead of replacing the strings of a language by other strings, it is sometimes 
useful just to mark them in some special way. In the Xerox calculus, an expression 
of the form 



A -> B ... C 



Marking A by B and C. 
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yields a transducer that maps any upper language string to a lower-language 
string that is identical to it except that any instance of A is preceded by a string 
from B and followed by a string from C. Here A, B and C may denote any regular 
language. In practice, however, B and C are usually atomic expressions. For ex- 
ample, a I e I i I o I u -> " [" • • • "] " yields a transducer that encloses 
vowels between square brackets leaving the rest of the text unchanged. The 
relation includes pairs such as 

icecre am 
[i] c [e] c r [e] [a]m 



4.1 Constraining Replacement and Marking 



Replacement and marking can be constrained in many different ways: by con- 
text, by direction of application, by length of the pattern to be replaced or 
marked. The basic technique for compiling constrained replacement and mark- 
ing transducers was invented in the early 1980’s by Kaplan and Kay |3 for 
Chomsky-Halle-type rewrite rules 0. It was also used very early for Kosken- 
niemi’s two-level rules mm- The idea was finally explained in detail in 
Kaplan and Kay’s 1994 paper [^. There is now a rich literature on this topic 
11(117151111151181 . The details vary but the basic method involves composing 
together a cascade of networks that introduce various auxiliary symbols into 
the input string, constrain their distribution, and finally eliminate the auxiliary 
alphabet. As there is no space to explore the compilation issue in a technical 
way, we will only explain the syntax of constrained replacement and marking 
expressions and give a few examples of the corresponding transducers without 
explaining how the expressions are compiled. 

The transducers compiled from the simple replacement and marking expres- 
sions are in general ambiguous in the sense that a string in the upper language 
of the relation is paired with more than one lower-language string. For example, 
a I a a -> " [" • ■ • "] " yields a marking transducer than maps the upper 
language string “aaa” into three different lower-language strings: 



[a] [a] [a] 



[a] [a a] 



[a a] [a] 



The -> operator does not constrain the selection of the alternate substrings for 
replacement or marking. In this case, the upper language string can be factored 
or parsed in three different ways. 

For many applications, it is useful to define another version of replacement 
and marking that in all such cases yields a unique outcome. The longest-match, 
left-to-right replace operator, @->, defined in HH, imposes a unique factoriza- 
tion on every input. The upper language substrings to be marked or replaced 
are selected from left to right, not allowing any overlaps. If there are alternate 
candidate strings starting at the same location, only the longest one is chosen. 
Thus a I a a @-> " [" ... "] " denotes a relation that unambiguously maps 
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“aaa” to “[aa][a]”. The transducers corresponding to the -> and @-> variant of 
this expressions are shown in Figure 





Fig. 4. An Ambiguous and an Unambiguous Marking Transducer 



Replacement and marking contexts can be specified using same notation as 
for restriction: L _ R, where L is the left context, R is the right context, and _ 
marks the site of the upper language string that is replaced or marked. In the case 
of a restriction expression, the interpretation of context is self-evident because 
a restriction denotes a set of strings. This is not the case for replacement and 
marking. Replacement and marking expressions must specify whether L and R 
pertain to the upper or the lower side of the relation. The Xerox calculus provides 
specific markers II,//, \\ and \/ to distinguish between the four possible cases: 



I I L _ R 
// L _ R 
W L _ R 
\/ L _ R 



L and R both on the upper side 
L on the lower, R on the upper side 
L on the upper, R on the lower side 
L and R both on the lower side 



To see the difference between, say I I and //, versions let us consider two variants 
of a phonological rule that shortens a double “aa” in the context of another 
double “aa” in the preceding syllable. Here C represents any consonant. 

Rule 1. aa->a||aaC+_ (Slovak) 

Rule 2. a a -> a // a a C+ _ (Gidabal) 

Vowel shortening is a very common type of morphological alternation under 
many different kinds of context conditions. Interestingly, in some languages such 
as Slovak the shortening depends on the lexical (upper side) context whereas 
in languages such as Gidabal (an Australian language), it is conditioned by the 
surface side0 The hypothetical lexical form “baacaadaafaa” would be realized 
quite differently in these two languages: 

baacaadaafaa baacaadaafaa 

baaca da fa baaca daafa 

Rule 1 Rule 2 

^ The symbol ? in an arc label represents an unknown symbol; in this case, any 
symbol other than [, ] , and a. By convention, the leftmost state is the start state, 
final states are indicated by double circles. 

This example is due to Martin Kay (p.c.). 
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In a language like Slovak, the last three syllables would all shorten yielding 
“baacadafa” whereas a language like Gidabal would show the alternating pattern 
“baacadaafa” . 

The two replacement transducers compiled from Rule 1 and Rule 2 are shown 
in Figure 0 





Fig. 5. Two Vowel-Shortening Rules 



Contextual constraints may be combined with the directional left-to-right 
and longest match constraints. For example, if C and V stand for consonants 
and vowels, respectively, a simple syllabification rule may be expressed in the 
following way: 

C* V+ C* @-> . . . I I _ C V 

This marking expression yields an unambiguous transducer that inserts a hyphen 
after each longest available instance of the C* V+ C* pattern that is followed by 
a consonant and vowel. The relation it encodes consists of pairs of strings such 
as 

struk tu ra lis mi 
struk-tu-ra-lis-mi 

In this case, the choice between I I and // makes no difference but the two other 
context markers, \\ and \/ could not be used here. 

The syllabification transducer is a simple finite-state parser: it recognizes and 
marks instances of a regular language in a text. In the next section we will show 
a more sophisticated example of this kind. 



5 Finite-State Syntax 

Although the syntax of a natural language cannot in general be described by a 
finite-state, or even a context-free grammar there are many subsets of natural 
language that can be correctly described by very simple means, for example, 
names and titles, addresses, prices, dates, etc. In this section, we examine one 
such case in detail: a grammar for dates. 

For the sake of illustration, let us consider here only one of several common 
date formats, expressions such as 
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Tuesday 

July 25 Tuesday, July 25 

July 25, 2000 Tuesday, July 25, 2000 

In the following we assume that a date expression consists of a day of the 
week, a month and a date with or without a year, or a combination of the two. 
Note that this description of the syntax of date expressions presents the same 
problem we encountered in the a I aa @-> a example in the previous section. 
Long date expressions, such as “Tuesday, July 25, 200”, contain smaller well- 
formed date expressions, e.g. “July 25”, that should be ignored in the context 
of a larger date. In order to simplify the presentation, we stipulate that date 
expressions are contiguous strings, including the internal spaces and commas. 

To facilitate the specification of the date language we first define some auxil- 
iary terms and then use them to define a language of dates and a parser for the 
language. The complete set of definitions is shown below: 

lTo9 = 1|2|3|4|5|6|7|8|9 

0To9 = "0" I lTo9 

Day = Monday I Tuesday I I Saturday I Sunday 

Month = Jcinuary I February I I November I December 

Date = lTo9 I [1 I 2] 0To9 I 3 ["0" I 1] 

Year = lTo9 (0To9 (0To9 (0To9))) 

AllDates = Day I (Day ", ") Month " " Date (", " Year) 

From these definitions we can compile a small finite-state automaton, 
AllDates, with 13 states and 96 arcs that describes a language of about 30 
million date expressions for the period from January 1, 1 to December 31, 9999. 

A parser for the language can be compiled from the following simple regular 
expression. 

AllDates @-> " [" ... "] " 

It yields a transducer of 23 states and 321 arcs that marks maximal date expres- 
sions in the manner illustrated by the following text: 

Today is [Tuesday, July 25, 2000] because yesterday was [Monday] 
aind it was [July 24] so tomorrow must be [Wednesday, July 26] . 

Because of the left-to-right, longest-match constraints associated with the @-> 
operator, the transducer brackets only the maximal date expressions. 

However, this regular-expression grammar is not optimal. The AllDates lan- 
guage includes a large number of syntactically correct but semantically invalid 
date expressions. For example, there is no “April 31, 2000”, “February 29, 1900”, 
or “Sunday, September 29, 1941” . April only as 30 days in any year; unlike year 
2000, year 1900 was not a leap year; and September 29, 1941 fell on a Monday. 

All these three types of imperfections can be corrected within the finite- 
state calculus. For each of these three types on invalid dates we can define a 
regular language that excludes such expressions. By intersecting these constraint 
languages with the AllDates language, we can define a language that contains 
only semantically valid dates. Figure El illustrates the idea. 
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Fig. 6. Refinement by Intersection 



We need three constraints: 

MaxDaysInMonth Restriction on the distribution of 30 and 31. 
LeapDays Restriction on February 29. 

WeakDayDate Restrictions on weekdays and dates 

In fact, all the constraints can be expressed by means of the restriction operator 
=> defined in the previous section. For example, to build the leap day constraint 
we first need to define the language of leap years, that is the language of all 
numbers divisible by four but subtracting centuries such as 1900 that are not 
divisible by 400. 

Even = "0" I 2 I 4 I 6 I 8 

Odd = 1 I 3 I 5 I 7 I 9 

N = lTo9 0To9* 

Div4 = [((N) Even) ["0" I 4 I 8]] I [(N) Odd [2 I 6]] 

LeapYears = Div4 - [[N - Div4] "0" "0"] 

Here we first define Div4 as the infinite set of natural numbers that are divisible 
by four. This set consists of two parts: numbers that end in 0, 4, or 8 possibly 
preceded by an even number and numbers that end in 2 or 6 preceded by an 
odd number. Finally, we define LeapYears as the set of numbers divisible by 
4 subtracting centuries that are not multiples of 400. Note that the expression 
[N - Div4] "0" "0" denotes numbers with two final zeros that are preceded 
by a number that is not divisible by four. For example, it includes “1900” but 
not “2000”. Because LeapYears is defined as Div4 minus this set, it follows that 
the string “2000” is in the language but “1900” is not. 

Once the language of leap years is defined, the distribution of “February 29” 
in date expressions can be constrained with the following simple restriction. 

LeapDays = February ""29", " => _ LeapYears .#. 

In other words: a date expression containing “February 29, ” must terminate 
with a leap year. The boundary symbol, . # . , is necessary here to mark the end 
of the year string in order to rule out expressions like “February 29, 1969” which 
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would qualify if we were allowed to take into account only the first three digits 
since year 196 is a leap year in the Gregorian calendar. 

The construction of the WeakDayDate constraint is not as trivial but not as 
difficult as it might initially seem. See m for details. Having constructed the 
auxiliary constraint languages we can define the language of valid dates as 

ValidDates = AllDates & MaxDaysInMonth & LeapDays & WeekDayDates 

The network contains 805 states, 6472 arcs, and about 7 million date expressions. 

We could now construct a parser that recognizes only valid dates. But we ac- 
tually can do something more interesting, namely, define a parser that recognizes 
all date expressions and marks them as valid, “[VD”, or invalid, “[ID”: 

ValidDates] 0-> " [VD" ... "] " , 

[AllDates - ValidDates] @-> " [ID" ... "] " 

This parallel replacement expression compiles into a 2699 state, 20439 arc trans- 
ducer in about 15 seconds on a Sun workstation. The time includes the com- 
pilation of all the auxiliary expressions and constraints discussed above. The 
following example illustrates the effect of the transducer on a sample text. 

The correct date for today is [VD Tuesday, July 25, 2000]. 

Today is not [ID Tuesday, July 26, 2000] . 

6 Conclusion 

Although regular expressions and the algorithms for converting them into finite- 
state automata have been part of elementary computer science for decades, the 
restriction, replacement, and marking expressions we have focused on are rel- 
atively recent. They have turned out to be very useful for linguistic applica- 
tions in particular for morphology, tokenization, and shallow parsing. Descrip- 
tions consisting of regular expressions can be efficiently compiled into finite-state 
networks, which in turn can be determinized, minimized, sequentialized, com- 
pressed, and optimized in other ways to reduce the size of the network or to 
increase the application speed. Many years of engineering effort have produced 
efficient runtime algorithms for applying networks to strings. 

Regular expressions have a clean, declarative semantics. At the same time 
they constitute a kind of high-level programming language for manipulating 
strings, languages, and relations. Although regular grammars can cover only 
limited subsets of a natural language, there can be an important practical ad- 
vantage in describing such sublanguages by means of regular expressions rather 
than by some more powerful formalism. Because regular languages and relations 
can be encoded as finite automata, they can be more easily manipulated than 
context-free and more complex languages. With regular expression operators, 
new regular languages and relations can be derived directly without rewriting 
the grammars for the sets that are being modified. This is a fundamental advan- 
tage over higher- level formalisms. 
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Abstract. In 0, G. Myers describes a bit-vector algorithm to com- 
pute the edit distance between strings. The algorithm converts an input 
sequence to an output sequence in a parallel way, using bit operations 
readily available in processors. 

In this paper, we generalize the technique, and characterize a class of au- 
tomata for which there exists equivalent parallel, or vector, algorithms. 
As an application, we extend Myers result to arbitrary weighted edit dis- 
tances, which are currently used to explore the vast data-bases generated 
by genetic sequencing. 



1 Introduction 

Finite automaton are powerful devices for computing on sequences of characters. 
Among the finest examples, very elegant linear algorithms have been developed 
for the string matching problem JIJ. Automata are also widely used in fields 
such as metric lexical analysis |3] or bio-computing, where approximate string 
matching is at the core of most algorithms that deal with genetic sequences . 
In these fields, the huge amount of data to be processed - sometimes billions of 
characters - calls for algorithms that are better than linear. 

One way to accelerate the computations is to exploit the parallelism of vector 
operations, especially bit-vector operations. For example, in |2j and 0, bit- 
vectors are used to code the set of states of a non-deterministic automaton. In 
this paper, as in |^, we want to accelerate computations done with deterministic 
automata, and we use vectors to represent sequences of events or sequences of 
states. 

Given a deterministic finite automaton, and an input sequence X\ . . . Xm, we 
are interested in the output sequence yi ■ ■ .ym of visited states. Since executing 
one transition is usually considered to be a constant time operation, the output 
sequence can be obtained in 0{m) time. 

In order to improve the efficiency of such algorithms, we have to find quicker 
ways to obtain yi . . . ym from x\ . . . Xm- The best possible algorithm would do it 
in constant time: 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 47-^^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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X\X2 ■ • ■ XjYi 

i 

yiV2 ■■■Vm 

Clearly, this can seldom be done when m is unbounded. The next best alter- 
native would be to obtain y\ . . . ym with a bounded number of vector operations 

on . . . Xm- 



X± X2 • ■ ■ Xfyi 
' 1 ' - 1 ^ \- 

2/1 2/2 2/m 

Indeed, vector operations can be implemented in parallel, in dedicated cir- 
cuits, or using high-speed bit-wise operations available in processors. The draw- 
back is that vector operations are applied component by component, meaning 
that the only computations that one could hope to solve with pure vector oper- 
ations are those where the value of yi depends only on the value of Xi, and its 
close neighbors. 

On the bright side, some bit operations widely available in processors do 
have a memory of past events. Using these, it is possible to parallelize complex 
computations done with automata. 



2 The Basics of Vector Algorithms 

As a simple example, consider the following automaton. On input sequence babba, 
it will generate the output siSoSiS 2 So ~ we omit the leading initial state. 

a a 




Let’s associate to a string x\ . . .Xm, the characteristic vector of a letter /, 
denoted by the bold letter I, and defined by: 

^ ^ f 1 if Xi = ; 

* [ 0 otherwise 

Characteristic vectors are sequences of bits, and we will operate on them 
with the standard bit operations: bit-wise logical operators, left and right shifts, 
binary addition, etc. We can obtain, for example, the characteristic vector of a 
set S of letters by computing the disjunction of the characteristic vectors of the 
letters in S. 
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In our example, we have the following direct computation of the character- 
istic vectors Sq, Si and S 2 of the output sequence yi ...ym, in terms of the 
characteristic vectors a and b of the input sequence. 



So ^ a 


( 1 ) 


Si = (tiso) A b 


( 2 ) 


S 2 = ~'{Sq V Si) 


( 3 ) 



where '[bl stands for a right shift of the vector I with the value b filled in in the 
first position. 

These three equations are used in the following way. Suppose, for example, 
that the input sequence given to the automaton is babba. The characteristic 
vectors of the letters a and b are thus: 

a = 01001 
b = 10110 

Equation (1) states that the output is Sq if and only if the input is a, thus: 

So = 01001 

The output state is Si when the input letter is b, and the preceding state is 
Sq. This can be computed, as in equation (2), by shifting vector Sq to the right 
and taking the bit-wise conjunction with vector b. 

Si = 10100 A 10110 
= 10100 

In all other cases, the output state is S 2 , which can be expressed, as in 
equation (3), by the bit-wise negation of the disjunction of characteristic vectors 
So and Si: 



S2 = -(01001 V 10100) 

= 00010 

If we assume that vector operations are done in parallel then, regardless of 
the length of the input sequence, the characteristic vectors of the output can 
be computed with 4 operations! This example is simple, since the output state 
depends on at most two input letters, but it gives the flavor of the technique. In 
general, the output states will depend on arbitrarily ’’far” events but, in some 
interesting cases, it will still be possible to reduce the computation to direct 
bit-vector operations. 



2.1 Remembering Past Events 

The simplest example of a computation that is influenced by past events is given 
by the following automaton A4. 
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In this case, whether input c yields state sq or si depends on events that can 
be arbitrarily far, as shown by the two input sequences ac" and 6c”. Fortunately, 
we have the following formulas that compute Sq and Si using binary addition 
with carry propagation - performed from left to right on the bit-vectors: 

The Addition Lemma: The characteristic vectors of states Sq and Si of Ai 
are 

50 = 6 V [c A (->6 + ->(6 V c))] 

51 = a V [c A -'(a -b (a V c))j. 

Proof: As noted in |^, automaton Ai is similar to the classical bit addition 
Moore automaton: 



1 , 1/0 




In this automaton, state si means that the carry bit is set to I. This state is 
reached when both bits are I, or when the two bits are different, and their sum 
is 0. Thus, if two vectors Xx and X 2 are added, the characteristic vector of state 
Si is 



{xx A X2) V [((-’ail A X2) V (a;i A -■3:2)) A ~'{xx + X 2 )]- 

Using the characteristic vectors associated with events a, h and c of the 
automaton Ai, define a;i = a and X2 — a \/ c. Then: 

a = Xx A X2 
b = -ixx A -'X2 
c = -ixx A X2 

With these identities, automaton Ai becomes a sub-graph of the bit addition 
automaton. Since Xx A -'X2 is always false, we get, by substitution, the formula 
Si = a V [c A -'(a -b (aV c))]. The formula for Sq is derived in a similar way with 
a;i = -lb and X2 = ~'{b V c). ■ 



2.2 Solving More Complex Automata 

In this section, we establish a sufficient condition for the existence of a vector 
algorithm for an automaton. 
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Consider a finite complete automaton A on alphabet of events S, with states 

T 

Q and transition function Q x S — ^ Q. We say that a state s is solvable if, for 
all events x £ S that do not loop on s, either all states reach s with x, or none 
does. Formally: 



Vs' G Q T{s', x) = s or 
Vs' ^ s £ Q T{s', x) yf s. 

A solvable state can be removed from an automaton in the following sense. 
Let Eg be the set of events for which T{s',x) = s for all s', and Vl\ {s} be the 
automaton obtained from A by removing s, and all its pending arrows. Then if 
s is solvable, Vl \ {s} is still a complete automaton on the alphabet E\Es, since 
T(s', y) yf s if y is not in Eg. 

Definition 1. An automaton A is solvable if it has one state, or if it has one 
solvable state s, and Vl\ {s} is solvable. 



Theorem 1. If an automaton A is solvable, then there exists a vector algorithm 
for A. 

Sketch of Proof. When a state s is solvable, the Addition Lemma can be used 
to compute the characteristic vector of s. Indeed, let Eg be the characteristic 
vector of the set Eg, and Lg, the characteristic vector of the events that loop on 
s but are not in Eg. Then: 

_ f Eg V [Lg A {-'Eg + -'{Eg V Lg))] or , , 

^~\Eg\/[LgA -^{Eg + {Eg V Lg))] 

according to whether s is initial or not. 

Suppose that we have computed the characteristic vectors of a subset D of 
solvable states, and let K be the disjunction of all those known characteristic 
vectors. 

If s is a solvable state in A\D, we will show that computing the characteristic 
vector of s is essentially a local decision, and can be done in a vectorial way. 




The value of Si will be equal to 1 in three circumstances. First, if the preceding 
state is known, that is, when Ki-i = 1, then the value of Sj = 1 can be decided 
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with the transition table of A. If the preceding state is unknown, then it belongs 
to D, and = 1 if the input Xi is in Eg- Let N be the characteristic vector 
resulting from these two possibilities. 

Vector N covers, at least, all the cases when s is reached from a different 
state. In order to account for looping events in Lg, we apply Equation 0 with 
Eg^N. □ 

Theorem 1 proves the existence of a vector algorithm, but does not give an 
efficient way to construct one. In the following sections, we will study in details 
a non trivial application. 



3 Approximate String Matching 

A common way to formalize the notion of distance between two strings is the edit 
distance, based on the number of operations required to transform one string into 
another. Three basic operations are permitted on individual characters: insertion, 
deletion and replacement. 

For example, in order to transform the string COMPUTER into the string 
SOLUTION, we can apply to the first string the sequence of edit operations 
RMDRMMIRR, where R denotes a replacement, D a deletion, I an insertion 
and M a match. 

Such a transformation is usually displayed as an alignment of the two strings, 
where matched or replaced letters are on top of each other, and insertions and 
deletions are denoted by a properly placed dashes. With our example, we get 
the alignment: 



COMPUT-ER 

SO-LUTION 

The edit distance between two strings is defined as the minimum number 
of edit operations - excluding matches - needed to transform one string into 
another. 

A crucial generalization of the edit distance for applications in biology is the 
weighted edit distance. It comes from the observation - in biological sequences 
- that replacements are not equally likely. Assigning different costs to different 
edit operations allows the construction of alignments that are meaningful from 
an evolutionary point of view. 

Let c be the cost associated to an insertion or a deletion, and 6{a, b) be the 
cost of replacing a by b. We define the cost of a sequence of edit operations 
to be the sum of the costs of the operations involved. Since a replacement can 
be achieved by a deletion followed by an insertion, the replacement cost 6{a, b) 
should be less than 2c. 

Definition 2. The weighted edit distance S{A,B) between two string A and B 
is the minimal cost to transform A into B. 
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In the sequel, we will focus on the following problem. Given a query sequence 
P = Pi . . - Pm, and a text T = t\ . . .tn, 'Vfe want to find the approximate occur- 
rences of P in T. Formally, the problem is to find all positions j in T such that, 
for a given threshold t > 0, we have miiig 6{P,T[g, j]) < t. Typically, P will be 
relatively short - a few hundred characters -, while T can be quite large. 

The classic solution 0 is obtained by computing the matrix C[0..m, 0..n] 
with the recurrence relation: 



C[i,j] = min ^ , j - 1] + c (5) 

[ C[i - 1, j] + c 

and initial conditions C[0,j] = 0 and C[i,0] = ic. 

The successive values of C[m,j] give the desired distances and can be com- 
pared to the threshold t. For example, suppose one wants to compute the ap- 
proximate occurrences of TATA in the text ACGTAATAGC . . . with the usual 
edit distance, that is c = 1 and 5(a, 6) = 1 if a yf 6. The computation is done 
with the help of a grid whose cells hold the values of G[i,j] - The following table 
gives a snapshot of the evolving computation: 





AG GT A AT AG G ... 




0000000000 0 ... 


T 


11110 110 11 1 ... 


A 


2122101101 


T 


3223211111 


A 


4333321212 



t t 



In this table, we can see that, in two occasions, the query string is at one 
unit of distance from substrings in the text - the substring T AA, at position 6, 
and the substrings ATA, AATA, and TAATA, at position 8. 

The whole computation can be carried out column by column, requiring 
0{nm) time and 0{m) space. In order to do better, we first state a useful 
lemma that bounds the absolute value of differences between horizontal and 
vertical values in the matrix: 

Lemma 1. |C[i,j] — G[i — 1, j]| < c and |C[f,j] — G[i,j — 1]| < c. 

Since the horizontal and vertical differences are bounded, we can code this 
computation as an automaton, which will turn out to be solvable. 



3.1 Computing Distances with an Automaton 

With the notations of Equation 0 define: 

Avij =c - (G[i,j] - G[i - l,j]) 
Ahij = G[i,j] - G[i,j - 1] -f c 
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From Lemma 1, Avi^j and Ahi^ are in the interval [0..2c], and if the successive 
values of Ahm,j are known, then the value of the score, C[m, j], can be computed 
by the recurrence, C[m,j] = C[m,j — 1] + Ah^j — c, and initial condition 
C[m, o] = me. 

With elementary arithmetic manipulations. Equation 0 translates as: 



Avij-i + Ahi-ij ( 6 ) 

2c 

with initial conditions AhQ j = c and Avi^Q = 0. We can thus define an automa- 
ton B that will compute the sequence Ahj = Ahij . . . Ahm,j given the sequence 
of pairs: 



(Avj_^,S{p,tj)) = {{Avij_i,6{pi,tj)) . . . {Av^j_i,S{pm,tj)). 

The states of B are {0, . . . , 2c}, with initial state c, and the transition function 
of B is given by following diagram, for an event {Av, J) in the cartesian product 
[0..2c] X [0..2c- 1]. 




r Av + 6 
where k = min < Av + k' 

I 2c. 



Theorem 2. Automaton B is solvable. 

Sketch of proof We will show that, for k G [0..2c), k is solvable in Bk-i = 
\ {0, . . . , fc — 1}, with the set of events {Av, i5) such that Av + 6 = k. 

First note that the only remaining events in Bk-i are those that satisfy 
Av -I- i5 > A:. If Av + S = k, then min(Z\n -|- 6, Av + k' , 2c) = k, since k' > k for 
states in Bk-i. If both Av + S > k and k' > k, then the minimum is certainly 
greater than k. □ 

Theorem 1 can then be used to produce a corresponding vector algorithm. 
Note that, in this case, looping events are easily detected, since if fc = fc' < 2c, 
then either Av = 0 or Av + 6 = k. 

In order to complete the presentation of a vector algorithm for the computa- 
tion of Avj, we use the relation Avij = Ahi-ij + Avij-i — Ahij which leads 
to the vector equation: 



Avj =fcAhj + Avj^i — Ahj. 
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4 A Bit- Vector Algorithm 

This section contains brief notes on how to implement a bit-vector algorithm for 
the approximate string matching problem. 

The first problem is to represent vectors of integers that are not necessarily 

0 or 1, but in the bounded interval [0..2c]. This can be done easily with an 

1 X m bit-matrix, where I = log (2c) -I- 1. There are well known algorithms for 
all the basic operations - assignment, comparison and arithmetic - on these bit- 
matrices. These operations are identified with arrows, and bold constants stand 
for constant vectors. 

Assuming that Av is known, the next letter of the text is read, and S{p, tj) 
is looked up in a pre-computed table. Three vectors, corresponding respectively 
to the three cases of the proof of Theorem 1, are initialized in the following way: 
Surrii •<— Av % Surrii will hold the values of Ahi-i -\- Avi. 

Sum -2 ■(— Av + ^ % Sum 2 is used to test if {Av, 6) £ E^. 

Loop ■£- {Av — 0) % Loop contains looping events not in Ek- 

The characteristic vector N of state k, from 0 to 2c — 1, is computed with 
the following equations, keeping track of the known states K . 

N ^ (tfeiT A {Sumi = fe)) V {Sum 2 = k) % b = {k > c). 

N <— N \/ [Loop A -'(Af -b {N V Loop))] 

N N A 

K ^ K\/ N 

Sumi ■£- Sumi + -'{'[bK) 

The values of the vector Ah can then be set to k using the mask N. Accord- 
ing to Theorem 1, when k = c, the initial state, the computation of TV should 
use an alternative formula. But, in this case, the first bit of TV is properly set 
by the first instruction since Sum^ contains the value c if Avi = 0. 

Finally the algorithm computes the characteristic vector of state 2c, the score, 
and the new value of Av. 



5 Further Developments 



As a first remark, we want to underline the fact that testing solvability for an 
automaton is a simple procedure that can be easily automated. If an automaton 
is solvable, it should also be possible to obtain automatically a corresponding 
vector algorithm, though not necessarily optimal. Indeed, in order to produce 
an efficient algorithm, Section 4 relied on the fact that the transition table of 
the automaton had many “arithmetic” symmetries. Is it possible to optimize the 
general algorithm? 

Another interesting avenue is to broaden the class of automata that can be 
parallelized. Clearly, not all automata are solvable, the simplest counter-example 
being: 
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a 




In this case, if one looks at an event x as a function from x : Q — > Q, then 
both a and b are permutations. For an automaton to be solvable, it must have at 
least one constant event. One way to generalize the notion of decidability would 
be to extend it to constant composition of events, hinting at the possibility that 
properties of the syntactic monoid (see jOj and jZj) are related to the existence 
of vector algorithms. 
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Abstract. We establish that regularly extended two-way nondetermin- 
istic tree automata with unranked alphabets have the same expressive 
power as regularly extended nondeterministic tree automata with un- 
ranked alphabets. We obtain this result by establishing regularly ex- 
tended versions of a congruence on trees and of a congruence on, so 
called, views. Our motivation for the study of these tree models is the 
Extensible Markup Language (XML), a metalanguage for defining docu- 
ment grammars. Such grammars have regular sets of right-hand sides for 
their productions and tree automata provide an alternative and useful 
modeling tool for them. In particular, we believe that they provide a 
useful computational model for what we call caterpillar expressions. 



1 Introduction 

We became interested in regularly extended two-way tree automata (tree au- 
tomata that have a regular set of transitions instead of a finite set and, thus, 
unbounded degree nodes) because of our work P| in which we show that tree 
languages recognized by caterpillar expressions are tree regular. Initially, we 
planned to prove this result by using regularly extended two-way tree automata 
to emulate caterpillar expressions and then applying the main theorem of this 
paper; namely, we generalize Moriya’s result jS] that demonstrates finite two-way 
tree automata have the same expressive power as finite bottom-up tree automata 
to regularly extended tree automata. Our proof of this result is, however, very 
different from Moriya’s. We first establish an algebraic characterization of the 
languages of regularly extended two-way tree automata and then show that the 
languages of regularly extended two-way tree automata satisfy the characteriza- 
tion. Unfortunately, we were unable to design a generic emulation of caterpillar 
expressions with regularly extended two-way tree automata. Therefore, we ended 
up using the algebraic characterization to to prove that caterpillar expressions 
recognize tree regular languages. 

Regularly extended two-way tree automata are also of interest in their own 
right since they provide greater programming flexibility than do regularly ex- 
tended one-way tree automata in much the same way that two-way finite-state 
automata do when compared to one-way finite-state automata. This choice is 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 57-^^ 2001. 
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motivated by the Standard Generalized Markup Language (SGML) 0 and the 
Extensible Markup Language (XML) j2], which are metalanguages for document 
grammars that rely on these requirements. Although most work on classes of doc- 
uments is grammatical in nature, grammars are not always the most appropriate 
tool for modeling applications. Murata 0 has argued that regularly extended 
tree automata often provide a more appropriate framework for investigating tree 
transformations, tree query languages, layout generation for trees, and context 
specification and evaluation. 

The research on tree automata and regular languages of trees can be divided 
into two categories: one dealing with ranked and the other with unranked alpha- 
bets. The bulk of the literature deals with finite, ranked alphabets. Gecseg and 
Steinby jS] have written a comprehensive book on tree automata and tree trans- 
ducers over ranked alphabets (an updated survey by the same authors appeared 
recently 0); see also the text of Gomon and his collaborators ^]. Although 
ranked and unranked alphabets are both finite, the transition relations of the 
corresponding tree automata for ranked alphabets can only be finite whereas the 
transition relations of the corresponding tree automata for unranked alphabets 
need not be finite. We consider the transition relation to be either regular or 
finite in the unranked case. We write finite tree automaton to mean that 
the tree automaton has a finite transition relation and we write (regularly) 
extended tree automaton to mean that the tree automaton has a regular 
transition relation. 

Tree automata for unranked alphabets appear to have been first developed 
by Thatcher 11211 Ml1 411,^1 . He states a number of results on finite tree automata 
that carry over directly from the theory of string automata. In particular, he 
developed the basic theory of finite tree automata and also introduced and in- 
vestigated extended tree automata. 

Other researchers studied various aspects of finite and extended tree au- 
tomata; see the work of Barrero P , Moriya 0 , Murata |S| and Takahashi dP. 

This paper has four further sections. In Section El we introduce the basic 
notation and terminology for extended tree automata, in Section El we introduce 
the notion of a top congruence and of views and, in Section P we use these 
notions to prove that extended two-way tree automata are only as expressive as 
extended bottom-up tree automata. Last, in Section El we state some conclusions 
and provide some research problems. 



2 Notation and Definitions 

We first recall tree and tree automata concepts before introducing the new con- 
cepts that we need. 

Definition 1. Trees have at least one node; their node labels are taken from a 
finite alphabet S. We represent trees by expressions that use the symbols in E 
as operators. 
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Operators have no rank, so they can have any number of operands, including 
none. For example, the term a(a(a()a())a(a()a())) represents a complete binary 
tree of height two, whose nodes all have the label a. Observe that external nodes 
or leaves correspond exactly to those subterms of the form a(). 

We denote symbols in S with a, strings over E with w and sets of strings 
over E (we call them string languages) with L. The Greek letter A denotes 
the empty string. We denote trees with t and sets of trees (we call them tree 
languages) with T. Subscripted and superscripted variables have the same types 
as their base names. 

Definition 2. We define the set nodes{t) of nodes of a tree t as a set of 
strings of natural numbers. Its definition is by induction on t: 

For a tree afti • ■ ■ tn), n> 0, we define 

nodes {a{t\ ■■■ tn) = [J i ■ nodes (F) U {X}. 

l<i<n 

The nodes of a tree viewed as terms correspond to subterms. We denote nodes 
of trees with v. 

Definition 3. The root node rootff) of a tree t is defined as A. For each node v 
of t we define the set children{v) of v’s children as the set of all nodes v ■ i 
in nodes (t). 



Definition 4. A node w of a tree t is a leaf if and only if children{v) = 0. The 
set of leaves oft is denoted by leaves (f). 



Definition 5. For each node v of a tree t, we denote the label of u in E 
by label{v). More precisely, for a tree t = a{ti ■ ■ - tn), n>0, we define: 

1. The label of the root node X in t is a. 

2. The label of the node i ■ s in t is the label of the node s in ti. 

We are now in a position to define the class of tree automata that we inves- 
tigate. 

Definition 6. A (regularly) extended two-way (nondeterministic) tree 
automaton M is specified by a triple (Q,S,F), where Q is a finite set of states, 
F C Q is a set of final or accepting states, and SCExQ*xQx {u, d, s} is a 
transition relation that satisfies the condition that, for all a in E, q in Q and m 
in {m, d, s}, the set {w € Q* \ (a, w, q, m) € 6} is a regular set of strings over the 
alphabet Q. 

If, for all a in E, q in Q and m in {u, d, s}, the set {w & Q*\ (a, w, q, m) € <5} 
is a finite set of strings over the alphabet Q, then M is a finite two-way tree 
automaton. 
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Finite two-way tree automata have been investigated by Moriya 0, whereas 
our results are on regularly extended two-way tree automata. 

We define the computations of a two-way tree automaton on a tree by se- 
quences of configurations. A configuration assigns a state of the automaton to 
each node in a cut of the tree. 

Definition 7 . A cut C of a tree t is a subset of nodes (t) such that, for each 
leaf node v of t, there is exactly one node in C on the path from the root to v; 
in other words, there is exactly one node in C given by a prefix of v. 



Definition 8 . A configuration c of a two-way tree automaton M = 

(Q,S,F) operating on a tree t is a map c : C — > Q from a cut C of t to 
the set of states Q of M. 

Let i/ be a node of a tree t and let c : C — >■ Q be a configuration of 
the two-way tree automaton M operating on t. If children(y) C C, then for- 
mally c{children{i')) is a subset of Q. We overload this notation such that 
c{children(y)) also denotes the sequence of states in Q which arises from the 
order of v's children in t. 

Definition 9 . 1 . A starting configuration of a two-way tree automaton 

M = (Q,d,F) operating on a tree t is a configuration c : leaves{t) — > Q 
such that c{v) is any state q in Q such that {label{v),\,c{v),u) € S. 

2 . A halting configuration is a configuration c : C — >■ Q such that C = 
{root{t)}. 

3 . An accepting configuration is a configuration c : C — Q such that 
C = {root(t)} and c{root{t)) G F. 

Definition 10 . 1 . A two-way tree automaton M = (Q, 6 ,F) operating on a 

tree t makes a transition from a configuration ci : Ci — > Q to a configu- 
ration C2 : C2 — > Q (symbolically Ci — >02) if and only if it makes an up 
transition, a down transition or a no-move transition each of which we now 
define. 

2 . M makes an up transition from ci to C2 if and only if t has a node v such 
that the following four conditions hold: 

a) children{v) C C\. 

b) C2 = (Cl \ children{v)) U {v}. 

c) {label{v) , ci{children{v)) , C2{v) , u) G <5. 

d) Cl is identical to C2 on their domains’ common subset Ci fl C2. 

3 . M makes a down transition from ci to C2 if and only if t has a node v 
such that the following four conditions hold: 

a) V € C\. 

b) C2 = (Cl \ {i'} U children {v)) . 

c) {label{v) , C2{children{v)) , c\{v) , d) G <5. 

d) Cl is identical to C2 on their domains’ common subset Ci fl C2. 
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4-. M makes a no- move transition from ci to ci if and only ift has a node v 
such that the following four conditions hold: 

a) V G Cl- 

b) C2 = Ci. 

c) {label{v) , d{v) , C 2 {v) , s) G 5 . 

d) Cl is identical to C 2 on C\ \ {v}, which is equal to C 2 \ {r'}- 

Definition 11. 1. A computation of a two-way tree automaton M on a 

tree t from configuration c to configuration c' is a sequence of configurations 

Cl, , Cn, n > 1, such that c = Ci — > ■ > Cn = c' . 

2. An accepting computation of M on t is a computation from a starting 
configuration to an accepting configuration. 

Definition 12. 1. A tree t is recognized by a two-way tree automaton M if 

and only if there is an accepting computation of M on t. 

2. The tree language T(M) of a two-way tree automaton M is the set of trees 
that are recognized by M. 

Definition 13. A (regularly) extended (nondeterministic) bottom-up 
tree automaton is an extended (nondeterministic) two-way tree automaton 
M = (Q,6,F) such that S contains only transitions whose last component is u. 
For a bottom-up tree automaton M , we consider 6 to be a subset of E x Q* xQ by 
dropping the fourth, constant component in the transition relation of a two-way 
tree automaton. 

Note that nondeterministic bottom-up tree automata are only as expressive as 
deterministic bottom-up tree automata 0. 

Definition 14. A tree language is regular if and only if it is the language of 
an extended bottom-up tree automaton. 

Clearly, since every extended bottom-up tree automaton is an extended two-way 
tree automaton, every regular tree language is recognized by some regular two- 
way tree automaton. Our goal is to prove that the converse also holds; namely, 
every tree language recognized by an extended two-way tree automaton is regu- 
lar. We establish this result indirectly by developing an algebraic characterization 
of regular tree languages and then proving that the tree languages recognized 
by extended two-way tree automata satisfy this characterization. 

3 Top Congruences 

Definition 15. A pointed tree (also called a tree with a handle or a handled 
tree) is a tree over an extended alphabet if U {X} such that precisely one node 
is labeled with the variable X and that node is a leaf. 
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Definition 16. Ift is a pointed tree and t' is a (pointed or nonpointed) tree, we 
ean catenate t and t' by replacing the node labeled X in t with the root oft' . 
The result is the (pointed or nonpointed) tree tt' . 



Definition 17. Let T be a tree language. Trees t\ and t 2 are top congruent 
with respect to T (ti I 2 ) if and only if, for each pointed tree t, the following 
condition holds: 

tti GT if and only if tt 2 € T. 

The top congruence for trees is the tree analog of the left congruence for 
strings. 

Lemma 1. The top congruence is an equivalence relation on trees; it is a con- 
gruence with respect to catenations of pointed trees with nonpointed trees. 



Definition 18. The top index of a tree language T is the number of ^t~ 
equivalence classes. 



Lemma 2. Each regular tree language has finite top index. 

A string language is regular if and only if it has finite index; however, that a 
tree language has finite top index is insufficient for it to be regular. For example, 
consider the tree language 

L = {a{b'E) : i > 1}. 

Clearly, L has finite top index, but it is not regular. A second condition, regularity 
of local views, must also be satisfied. 

Definition 19. Let T be a tree language, a be a symbol in E, t be a pointed tree 
and Tf be a finite set of trees. Then, the local view of T with respect to t, 
a and Tf is the string language 



Vt.a.Tf (T) — {ti ■ ■ ■ tn & Tf \ ta{ti ■ ■ ■ tn) G T} 

over the alphabet Tf. For the purposes of local views we treat the trees in the 
finite set Tf as symbols in the alphabet Tf; the trees in Tf are primitive entities 
that can be catenated to give strings over Tf. Note that we are not catenating 
trees. 



Lemma 3. All local views of each regular tree language are regular string lan- 
guages. 
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Example 1. Let 

T = {c{ti ■ ■ - tn) I label{root{ti)) ■ ■ ■ label{root{tn)) € {a}b^ \ I > 1}}. 

The tree language T has top index four. Two of its equivalence classes are the 
sets of trees whose root labels are a or 6; the other two are T and the set of trees 
that are not in T, but have the root label c. The local view of T with respect 
to the pointed tree X{), symbol c and the finite set of trees {a{),b{)} is the 
non-regular set of strings | I > 1}. Hence, T has finite top index but it is 
not regular. 

Theorem 1. A tree language is regular if and only if it has finite top index and 
all its heal views are regular string languages. 

At first glance it may appear that the local-view condition for regular tree 
languages is a condition on an infinite number of trees. But, if we exchange a 
tree in a finite set Tf by an equivalent — with respect to top congruence — 
tree t 2 , then Va^t,(Tf\{ti})u{t 2 }i'I') is the homomorphic image of Vau,Tf{T) under 
a string isomorphism. Hence, if T has finite top index, we need to check the 
local- view condition for only a finite number of tree sets Tf. 

4 Regularly Extended Two-Way Tree Automata 
Languages 

Lemma 4. The language of every extended two-way tree automaton has finite 
top index. 

Lemma 5. The languages of all extended two-way tree automata have only reg- 
ular local views. 

Proof. Let t be a pointed tree, a be a symbol in E, and Ty be a finite set of 
trees. We demonstrate that the local view Vt^a.Tf (T) of T with respect to t, a, 
and Tf, namely the string language 

{ti ■ ■ ■ tn G Tf \ ta{ti ■ ■ ■ tn) G T}, 
is regular. 

The proof is in three steps. 

The first step is to recognize that Vt^a.Tf is a finite union of finite intersections 
of the following sets Xp and Xpq, p,q G Q: 

Xp = {h---tnG T*f I 

Cl > C2, 

Cl is a starting configuration of M on a(ti ■ ■ ■ tn), 

C 2 is a halting configuration of M on a(ti ■ ■ ■ t„), 

C 2 {root{a{ti ■ ■ ■ tn))) = p and 

there is no other halting configuration in the computation ci — > 



and 
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Xpq — {^1 • ■ • G Tj I 
Cl — >C2, 

Cl and C 2 are halting configurations of M on a{ti ■ ■ ■ tn), 

Ci{root{a{ti ■ ■ ■ tn))) =p, 

C 2 {root{a{ti ■ ■ ■ tn))) = q and 

there is no other halting configuration in the computation ci — 

Any computation on ta{ti ■ ■ ■ tn) from a starting configuration to a halting 
configuration can be partitioned into those parts that concern only t and those 
parts that concern only a{ti ■ ■ ■ tn) - The parts that concern only a{ti • ■ ■ tn) form 
a computation from a starting configuration to a halting configuration, followed 
by a number of computations from halting configurations to halting configura- 
tions. 

Hence, the a{ti ■ ■ ■ t„)-related parts of any accepting computation of M on 
ta{ti ■ ■ • tn) first go from a starting configuration to a halting configuration, 
having M in some state p at a{ti ■ ■ ■ t„)’s root, and then from halting configura- 
tion to halting configuration, leading M from some state pi to some state qt 
on a(ti---t„)’s root until M finally leaves a(ti • • • t„) and does not return. 
This implies that ti - ■ - tn is in Xp fl Xp^q^ fl • • • fl Xp^q^. The state sequence 
p,pi,qi, . . . ,Pr,qr documents the behaviour of M at the root of a{t\ ■ ■ ■ tn) dur- 
ing an accepting computation of M on the complete tree ta{t\ • ■ ■ tn)- 

For any other sequence of trees t'l ■ ■ - t'm in Xp (1 Xp^q^ fl • • • fl Xp^q^, we 
can construct an accepting computation of M on ta{t'i ■ ■ ■ t'^) by patching 
the a{ti ■ ■ •t„)-related parts of the original computation with computations on 
a{t'i ■ ■ ■ t'ni) that have the same state-behaviour at the root as a{ti ■ ■ - tn) had. 
Since 1 1 • • • t'^ is in Xp fl Xp-^q-^ fl • • • fl Xp^q^, we can find such patches. 

We conclude that the whole set Xp fl Xp^q^ fl • • • fl Xp^q^ is a subset of Vt^a,Tf - 

Since there are only finitely many sets Xp and Xpq, the set Vt^a,Tj is a finite 
union of finite intersections of these. 

The next two steps establish that Xp and Xpq are regular string languages. 

First, a string is in Xp if and only if there are pi , . . . , in Q such that 

M, when operating on ti beginning in a starting configuration, eventually reaches 
a halting configuration c such that c(root(ti)) = pi and {a,pi - ■ ■ Pn,P,u) € <5. The 
regularity of the transition table S implies that Xp is a regular string language. 

Second, let X^^ be the subset of Xpq in which the computation ci — > C 2 
(compare the definition of Xpq) makes just one computation step and let X^ 
be the subset of Xpq in which the computation ci — > C 2 makes more than one 
computation step. Then, Xpq is the (not necessarily disjoint) union of Xp^ and 
X™. We demonstrate that both subsets are string regular. 

First, note that 

^pq = {h---tn€Tf \ {a,p, q, s) G i5}. 

Hence, Xp^ depends only on a and is either empty or Tj- In both cases, Xp^ is 
regular. 
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Second, ti • • • G X™ if and only if there are pi, . . . qi, . . . , in Q such 
that (o,pi ■ ■ ■ PmP,d) is in (5 and M, when operating on ti, makes a computation 
from a halting configuration with root label pi to a halting configuration with 
root label qi and (a, qi ■ ■ ■ Qn, q, u) G S. 

The regularity of the transition table 6 implies that X™ is a regular string 
language. □ 



Theorem 2. The language of every extended two-way tree automaton is tree 
regular. 

5 Concluding Remarks 

Moriya jS] uses crossing sequences to prove that finite two-way tree automata are 
as expressive as finite bottom-up tree automata. Thus, one follow-up question is 
whether we can prove our result using Moriya’s approach. 

We may define context-free two-way tree automata and ask whether they are 
as expressive as context-free bottom-up tree automata. Moriya 0 considers a 
pushdown variation on tree automata for which he demonstrates that the two- 
way version is indeed more expressive than the bottom-up version. Salomaa uni 
proves that the yield languages of two-way pushdown tree automata are the 
recursively-enumerable languages . 

Takahashi nn, on the other hand, establishes a different characterization of 
regular tree languages. We would be interested in knowing whether her charac- 
terization can be used to derive our algebraic characterization and, conversely, 
can we use our characterization to prove her’s. 

As we mention in the introduction, we originally planned to use extended two- 
way tree automata to emulate (or execute) caterpillar expressions. Our difficulty 
was that we could not design such an emulation. Therefore, is there an effective 
emulation of caterpillar expressions with extended two-way tree automata? 
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Abstract. We present an extension to multiplicities of a classical al- 
gorithm for computing a boolean automaton from a regular expression. 
The Glushkov construction computes an automaton with n -|- 1 states 
from a regular expression with n occurences of letters. We show that 
the Glushkov algorithm still suits to the multiplicity case. Next, we give 
three equivalent extended step by step algorithms. 



1 Introduction 

First of all, let us recall which was the progression towards the results presented 
below and the underlying ideas. Several softwares have been developped con- 
cerning boolean automata. Let us cite AMoRE Automate Grail uni, 
ABOOL. For multiplicities, a Maple package called AMULT has been developped 
by Flouret and Laugerotte. Integration of the AMULT and ABOOL packages in 
the SEA environment Q gathered the two underlying theories. Indeed, this en- 
vironment allows the two packages to complete rational operations on automata, 
and also to go from the multiplicities theory to the boolean one and vice versa, by 
“suppressing” or “extending” coefficients. The classical rational operations are 
common for both theories (union and sum, concatenation and Cauchy product, 
Kleene’s closure and star), although restrictions can be needed in some cases. We 
can then wonder whether similar algorithms could be applied on automata for 
both theories. Theoretical and algorithmic backgrounds enable us to compute 
automata from regular (rational) expression. Let us cite for the boolean case 
02CS|. Several algorithms can compute an automaton from a regular expres- 
sion. In the boolean case, a classical algorithm of interest is the Glushkov one 
P] which leads us to compute an automaton with n -I- 1 states where n in the 
number of occurences of letters appearing in the expression. In the multiplicity 
case, the results of Schiitzenberger m giving the equivalence between rational 
and recognizable series, are the bases of construction algorithms demonstrated 
in p]. This construction computes an automaton having a number of states of 
order 2n from a rational expression with n occurences of letters. Taking into 
account the efficiency of the Glushkov algorithm (in terms of size) we show that 
we can fit this algorithm in the case of multiplicities in any semiring. This new 
construction will be called the extended Glushkov construction. After some pre- 
liminaries, we extend in section 3 the Glushkov algorithm to the multiplicity 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 67-1751 2001. 
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case and, in section 4, the step by step construction. Our last results are to show 
the equivalence of these two constructions. 

2 Theoretical Background 

2.1 Definitions and Prerequisites 

Let if be a finite alphabet, 1 the empty word, and K be a semiring. A formal 
series 0 is a mapping S from E* into K usually denoted by S' = ^ {S\w)w 

U* 

(where (S\w) := S{w) G K is the coefficient of win S). In the boolean case (resp. 
multiplicity case) simple rational operations are union (resp. sum), concatenation 
(resp. Cauchy product), and Kleene’s closure (resp. star). An external product 
has to be defined in the case of multiplicities. Concatenation is extended to series 

by the convolution formula R.S = E E {R\u){S\v) ] w. Remark that, if 

\uv—w / 

(s|i) = 0, s* = ^ w)w is well defined. We extend the rational closure 

n>0 

to positive closure, S+, taking n > 0. 

A regular language (resp. rational series) is obtained from the letters by a 
finite number of combinations of the rational laws. The formula thus obtained 
is called a regular (resp. rational) expression of S. 

A boolean automaton A4 over an alphabet E is usually defined |7I1 Oj as a 
5-tuple {E,Q,I,F,S) where Q is a finite set of states, ICQ the set of initial 
states, F C Q the set of final states, and 6 C Q x E x Q the set of edges. We 
denote by L{E) the language represented by the regular expression E and by 
L(Af) the language recognized by the automaton Af . Kleene ^2] asserts that it 
always exists an automaton M.e such that L{E) = L{JcIe). This feature can be 
extended to the case of multiplicities in any semiring. 

A K-automaton [Zj over an alphabet E is then a 5-tuple (E,Q,I,F,S) on a 
semiring K, and the sets /, F and S are rather viewed as mappings / : Q — >■ K, 
F : Q — )> K, and JiQxAxQ— )>K. In fact, a K-automaton is an automaton 
with input weights, output weights, and a weight associated to each edge. Here, 
for each word w = ai ■ ■ ■ Op € E*, the coefficient (S\w) is the sum of the weights 
of successful paths labeled by Oi • • • Op, this weight being obtained by the product 
of input, output and edges weights of the path. This is equivalent (with n = |(5|) 
to the data of a triplet (A, /i, 7) where A G K^^” codes the input states, 7 G K"^ ^ 
codes the output states, and ^ : A — >■ K"^" codes the transition matrices for 
each letter a G E and is extended in a morphism from E* to K"^”, n being called 
the dimension of the representation. Then, a series S is recognizable if and only 
if it exists an automaton A4 — (A,/i, 7) such that its behavior XgL{w)'jw 

wGE* 

is S. Schiitzenberger’s classical theorem jl 6l8j asserts that rational series are 
exactly recognizable ones. This is then an extension of Kleene’s classical result 
m- One can notice that we have to be very careful with the validity of rational 
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expressions. Indeed the star of a rational expression where the coefficient of the 
empty word is not 0 can not be a valid rational expression and then we can not 
compute in this case an automaton. 

In this paper, we will use both notions of rational expression and series, as 
finite or infinite coding, according to the constructions. 

There are several constructions of boolean automata from regular expression 
imini. In this paper we are interested in the Glushkov construction which 
allows us to have an automaton with n + 1 states where n is the number of 
occurences of letters in the expression. The first step is to index each occurence 
of letter by its position in the expression E. The set of indices is named Pos{E). 
Let E be the indexed expression. Glushkov defines four functions on E allowing 
us to compute a non necessarily deterministic automaton. Eirst{E) represents 
the set of initial positions of words of L{E), Last{E) the set of final positions 
of words of L{E), Follow{E,i) the set of positions which immediately follows 
the position i in the expression E. Null{E) returns {e} if the language L{E) 
recognizes the empty word, 0 otherwise. Notice that, in the sequel of the paper 
we will write Eirst{E) for Eirst{E), as for the three other functions. These 
functions allow us to define the automaton Ai = {E, Q, s/, E, (5) where 

1. if is the indexed alphabet, 

2. Q = Pos{E)U{si} _ 

3. Vi G First(E), 6{si,ai) = {i}, ai € E 

4. Vi G Pos{E), Vj G Follow{E,i), S(i,aj) = {j}, Uj G E 

5. F = Last{E) U Null{E)- Si 

From Ai we compute the automaton A4 = {E,Q,si,F,S) by replacing the 
indexed letters on edges by their corresponding letters in the expression E. For 
details on this construction see m- 



2.2 Classical K Constructions 

In the following K is a commutative semiring. 

Classical K matrix constructions have been given in |S| and proved in [0|. We 
just recall them for the usual rational operations. 

Proposition 1. Let R (resp. S) a rational series, A: G K and consider 
(A’', /i’', 7 ’') (resp. (A®,/r®, 7 ®)^ of dimension p (resp. q) an associated matrix 
representation. The linear representations of the external product, sum, concate- 
nation and star are respectively 
k-R: 



R + S: 









^pxq 


Qqxp 


^I%a) 







( 1 ) 

( 2 ) 
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R.S : 



(A"0ix,), 

If A«7" = 0, S'* : 

(0ix,l), 



Y{a) 


YXY‘'{a) 


Oqxp 


Y{a) 



a£S- 



Y 



Y{a) +YX"‘Y{o-) 


0(JX 1 


A>*(a) 


0 



a£S- 



r 

’V 1 



(3) 

(4) 



We can notice that these constructions are always valid whatever the struc- 
ture of the automata on which they are applied may be. 



Proposition 2. Let E be a rational expression in K such that \E\ = n is the 
number of its letters, A4 e is the automaton obtained from E with the classical 
constructions, and is the number of its states. Then 2n < \A4 e\ < 3n. 



The proof of this proposition is in the full version of the paper. 



3 The Extended Glushkov Automaton 

Let E be an expression over the alphabet if of a rational series with coefficients 
in K. We index letters by their position in the expression. Let E be the new ex- 
pression and Pos{E) = {i £ W \ at G E,a G E}. For computing the automaton 
corresponding to a rational expression with the Glushkov algorithm, as in the 
classical (boolean) case |2t|, four recursive functions on E have to be defined. 

3.1 Extended Definitions 

The Null function allows us to know the coefficient of the empty word. 

Definition 1. The Null function can be defined recursively on the expression as 
follows: Let fc C K. 

Null{%) = 0 
Null{k) = k 
Null{a) = 0 
Null\k-E) =k-Null{E) 

Null{E + G) = Null{F) + Null{G) 

Null{E- G) = Null{Fy Null{G) 

Null{F+) = 0 
Null{F*) = 1 

We have to define an external product of a constant and a set of couples. 
Let k gK and X = {{Is, is) | Zs S K, ig G N}. The product can then be written 
k-X = {{k X ls,is)}i<s<p if A: 0, 0-X = 0 and X-k = {{Is x k,is)}i<s<p if 

fc ^ 0, X-0 = 0. 

Then the states connected to the initial state, with their associated coefficient 
are given by the First function. 
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Definition 2. First is recursively defined as follows: 

Firstifb) = 0 
First\k) = 0 
First\ai) ={(l,i)} 

First{k- F) = k- First{F) 

First{F + G) = First{F) U First{G) 

First{F- G) = First{F) U Null{F)- First{G) 

First{F^) = First{F) 

First{F*) = First{F) 

The terminal states with their associated output coefficients are given by the 
Last function. 

Definition 3. Last can be recusively defined as follows: 

Lastfitf) = 0 

Last\k) = 0 

Last{af} ={(l,t)} 

Last{k- F) = Last{F) 

Last{F + G) = Last{F) U Last{G) 

Last{F- G) = Last{F)- Null{G) U Last{G) 

Last{F~^) = Last{F) 

Last{F*) = Last{F) 

For the following of the paper, we have to introduce the Goejff function which 
is defined on a set of couples. Let X be a set of couples (k,p), where k £ K 

and pGQ, Coeffxii) = | q glswhere"^ Coeff:n{i) = 0. We will also use the 

Xx{i) function which is defined by Xx{i) = | J • 

Now, the follow function allows us to compute the edges connecting the states 
each others as well as the corresponding multiplicities. 

Definition 4. Follow is recursively defined by: (i G N*^ 

Followfihfi) =0 

Follow\k,i) =0 

Follow{aj,i) =0 

Follow\k- F,i) = Follow{F,i) 

Follow{F + G, i) = Tpos(F)(*)’ Follow{F, i) U Xpgg(^Q^{i)- Follow{G, i) 
FoUow\f- G, i) = Xpos{F) \i)' Follow{F, i) 
l>Xpos(G){'i)- Follow{Gfi) 

U Foeff^^^^t^p) (i) ■ First{G) 

Follow{F^,i) = Follow{F*,i) 

= Follow{F,i) U Coeffp^^^(p)ii)- First{F) 



All these functions allow us to define an extended Glushkov automaton. 
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Definition 5. Let A4 = {E,Q,I,F,S) be the extended Glushkov automaton of 
the expression E. It is defined on a semiring K and on an alphabet E as follows: 



- Q = Pos{E) U {0} 

— / : Q — >■ K such that 



0 ^ 1 
i — 7> 0 



— E : Q ^ K. such that 



0 ^ Null{E) 



- (5:Qxi7x(5— 

S{i,a,j) = CoeffpMow(E,r)ii) and Ga (i^O) 
(5(0, a, j) = Coejfpi^^t^E){j) and € d 
6{i, a, 0) = 0. 



3.2 Matrix Extension of the Glushkov Construction 



Definition El gives the computation of the extended Glushkov automaton. The 
linear representation (A, /i, 7) is deduced from this definition. Its dimension is 
\Q\. For 0 < i,j < \Q\ - 1, Ai = I{i), 7* = E{i), and Pij(a) = S{i,a,j). 

Example: 

Let K = Z. Let E = 5[2a6 + 36-4(a6)*]* then we have E = 5[2ai52 + 
363-4(0465)*]*. Let us compute the five functions defined above. 

Null{E) =Null{5-F*) 

= 5-Null{F*) 

= 5 

Last{E) = Last(2ab + Sb- 4{ab)*) 

= Last{2ab) U Lastifib- 4(o6)*) 

= {(1, 2)} U Last(b)- Null{4:{ab)*) U Lasi(4(o6)*) 

= {(1,2), (4,3), (1,5)} 

First{E) = 5- Fzrs6(2o6 + 36- 4(o6)*) 

= 5- First{2ab) U 5First{Sb- 4(o6)*) 

= 10- First(ab) U 15- First{b- 4(o6)*) 

= 1(10,1), (15, 3)1 
Follow{E,l) = Follow{2ab,l) 

= Follow{a, 1) U 1(1, 2)} 

= {( 1 , 2)1 

Follow{E, 2) = Follow(2ab + 36- 4(o6)*, 2) U First(2ab + 36- 4(o6)*) 

= Follow{2ab + 36- 4(o6)*, 2) U {(2, 1)} U {(3, 3)} 

= Follow\2ab, 2) U {(2, 1)} U {(3, 3)} 

= 1(2,1), (3, 3)1 

Follow{E, 3) = Follow(2ab + 36- 4(o6)*, 3) U {(2, 1), (3, 3)} 

= Followlib- A{ab)*,5) U {(2, 1), (3, 3)} 

= First{A{ab)*) U {(2, 1), (3, 3)} 

= 1(4, 4), (2,1), (3, 3)} 
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Folloiv{E,4:) = Follow{2ab + 3b- 4:{ab)* ,4) 

= Follow{3b-4:{ab)*,4:) 

= FoUow{{ab)*,4) 

= Follow{ab,4) 

= {( 1 , 5 )} 

Follow{F, 5) = FoUow{2ab + 3b- 4{ab)*, 5) U {(2, 1), (3, 3)} 
= Follow\{ab)* ,4) U {(2, 1), (3, 3)} 

= {(1,4), (2,1), (3, 3)1 




Fig. 1. Extended Glushkov automaton of the expression E = 5[2a6 + 3b- 4(a&)*]* 



In terms of matrices representation, it can be written as follows 



/o 


10 


0 


15 


O 

o 






0 


0 


1 


0 


0 0 




0 


0 


2 


0 


3 


0 0 




1 


0 


2 


0 


3 


4 0 


,7 = 


4 


0 


0 


0 


0 


0 1 




0 


VO 


2 


0 


3 


10^ 




vJ 



We can notice that for each letter a, the only non null columns are the 
ones indexed by j such that aj G a. The matrix of the transition edges of the 
automaton can be seen as the superposition of the matrices for each letter, which 
justify the writing of /i above. 



Proposition 3. Every edge reaching a given state is labelled by the same letter. 
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Proof. Extending the definition of 5 to E, the construction of the extended 
Glushkov automaton implies that if I ^ j, S(i, ai,j) = 0. □ 

The construction of the extended Glushkov automaton leads to the following 
properties. 

Proposition 4. Let (A, ^, 7 ) he the linear representation of an extended 
Glushkov automaton with n + 1 states. This representation has the following 
properties: 

1 - A = (1 Olxn); 

2. Vo € E, \/i G [0..n], sueh that a = {ai \ I G [l-.n]} Pij{a) = 0 if aj ^ d. 

Theorem 1. The extended Glushkov automaton of an expression E recognizes 
the series denoted by E. 

Sketch of proof . This theorem is proved inductively, li E = k, the representation 
obtained from definition Elcan be written ( (1) (0) (fc) ) and then A- p{w)- 7 = /c 
if w = e, 0 otherwise. We verify the same way for E = a that X-p(w)-"f = 1 
if w = a, 0 otherwise. Let F and G be two rational expressions representing the 
series S and T. We then suppose that the extended Glushkov automata A4 p and 
Aic recognize respectively the series S and T. We prove that for E = F + G the 
extended Glushkov automata Aip+c recognizes the series S + T. For E = F.G, 
we show that Xp.p- T s.t{w)-js-t = E (Xs- Ps(u)-Js) X (AT•A^T(^')•7T), and 

w—u-v 

so on for E = k- F and E = F* . □ 

4 Step by Step Construction 

The step by step algorithm consists in computing a new automaton starting 
from automata with the following properties 

— single initial state with no incoming edge, 

— reduced automata. 

This first property uses the proposition 0 We can then replace the mapping I 
by the state s/ which is the only one to have a non null input weight (which is 
!)• 

We give here the three step by step algorithms corresponding to the ratio- 
nal operations Gauchy, Sum, Star. Let M.i = (A7i, Qi, si, Ei, i5i) and M .2 = 
(E 2 ,Q 2 ,S 2 jF 2 ,S 2 ) be the automata of two series with the preceding propeties. 

4.1 Sum Algorithm 

The sum automaton M 3 = (Afa, Q3, S3, F3, A3) = Mi + M 2 is defined by the 
following algorithrrQ. 

^ For all the algorithms, we write for simplicity (p, a, q) G Si for (p, a, q) G Qi x Ei x Qi 
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begin 

^3 i — Si U S2 
Q3 ^ Qi u Q2 \ {52} 

S3 ^ Si 

^3(53) ^ ^i(si) + ^2(52) 

^ 3 (g),gQi\{si} ^ ^1(9) 

^ 3 ( 9 )g 6 Q 2 \{s 2 } ^ F 2 {q) 

foreach {p, a, q) G < 5 i do 
h{p,a,q) ^ Slip, a, q) 

end foreach 

foreach (p, a, q) G ^2 such that p ^ S2 do 
Sz{p,a,q) ^ S2{p,a,q) 
end foreach 
foreach (s2,a,q) G ^2 do 

<53(53,0,9) ^ <52(52,0,9) 

end foreach 

end 



4.2 Cauchy Product Algorithm 

Al3 = (A3, Q3, S3, A3, ^3) = Ml- M2 is computed as follows. 

begin 

A3 i — Al U A2 

Q3 ^ Qi u Q2 \ {52} 

S3 •«— Si 

^ 3 .{q)q(.Q^\{s^} ^ ^2(9) 
if A(s2) 0 then 

^ 3 iq)qdQ, ^ Fiiq) X A 2 (s 2 ) 

else 

^ 3 ( 9 ), 6 Qi ^ ^ 

end if 

foreach (p, a, q) G < 5 i do 
Sz{p,a,q) ^ Slip, a, q) 

end foreach 

foreach (p, a, q) G <52 such that p ^ S2 do 
Szip,a,q) ^ S2ip,a,q) 

end foreach 

foreach p G Qi such that Flip) 7^ 0 do 

foreach 9 G Q2 such that it exists a G A2 and (s2, a,q) GS2 do 
Ssip,a,q) ^ Flip) x <52(52,0,9) 
end foreach 
end foreach 

end 
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4.3 Star Algorithm 

■^3 = (^3i Qs, S3j ^3; <^ 3 ) = 

begin 

A 3 •<— Ai 

Qs Qi 

S3 ^ Si 

Fsiss) ^ 1 

F3{p)peQl\{s^} ^ Flip) 
foreach (p, a, q) G <5i do 
h{p,a,q) ^ Slip, a, q) 

end foreach 

foreach p G Qi such that Flip) 7 ^ 0 do 

foreach q G Qi such that it exists a G Si and <5i(si,a, g) 7 ^ 0 do 
Szip,a,q) G- Flip) x <5i(si,a,g) 
end foreach 
end foreach 

end 



The proposition bellow gives the matrix constructions patterns for the ra- 
tional operations. We will always consider the construction over an alphabet S. 

/O 



We need the following definitions. Let Id!^ = 



0 ••• 0 

Idji 



T(j3 — 



Id„ . and 



Vo 



Ln = ( 1 0 • • • 0 ), with Idn the square matrix identity of dimension n. For every 



n 

matrix M of size n+1, M- Id^ deletes the first column, /d®- M deletes the first 
row and L„+i- M selects the first row. 



Proposition 5. Let R (resp. S) a rational series in K and let pr = (A’’, /r'’, 7 ’') 
(resp. ps = (A®, /i'*, 7 ^*)J a eorresponding linear representation of dimension n + 1 
(resp. m+1) with a single initial state with no incoming edge. The step by step 
algorithms lead to the following representations for k- R, R + S, R- S. For all 
cases, the resulting representation will he denoted iX,p,j). The linear repre- 
sentations of the external product, the sum, the concatenation and the star are 
respectively 
k-R: 

( 1 Olxn ) 



/ 0 kLn+ip^ia)Idi\ kLn+iY\\ 

lo„xi idipk-Wdi ’I IdiY ’ 



Denote M = (l Oixn Oixm), then we have: 
R+S-. 




' 0 








Onx 1 




Onx m I 




_ Omx 1 


Omxn 




aeE- 



( Ln-\-lY + Lm+l7 

IdiY 

IdfnY 
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R-S: 



( ° 


Lr,+ifi^{a)Idi 


L„+i7’'A>«(a)Jdi\ 1 


Onx 1 


~7dsjF{a)W 




\ Omxl 


Omxn 





/ Ln+lYLm.+ lY\ 
\ Id?nY } 



If A« 7 « = 0, S'* : 



1 0 



Ixm , 



Jmx 1 



Lm+iIL%a)Idi 






aeS-i 




Sketch of proof. First, the elements of the linear representation associated to 
each operation are built by direct use of the algorithms. Next, we prove that 
these representations recognize the resulting series. 

The complete proof is given in the full version of this paper. □ 



Corollary 1. The step by step algorithms preserve the properties of single inital 
state and of trim automaton. 



The extended Glushkov construction can be written in terms of matrices 
representations. In fact, this way of computation consists in the substitution of 
operations on series (expressions) into operations over automata. The following 
proposition give the completely recursive representation of such automata, that 
is completely free from the calculus of the functions Null, First, Last and Follow 
and prove the equivalence of the two constructions. 

Proposition 6. Let (A^,/i^, 7 ^) = 



((1)4(0) 



' aes 




(5) 



and (A“, 7 “) = 



;io). 



0 1 
0 0 






, (02x2)h^Qgi; 




( 6 ) 



the representation of a. 

With these basic constructions, the step by step algorithms build a Glushkov 
automaton from a rational expression. 



Sketch of proof. Let F a rational expression with n occurences of letters, 
(A-^,^^, 7 -^) its extended Glushkov representation. For F = k and F = a, repre- 
sentations and 0 correspond by its definition to extended Glushkov repre- 
sentation, that is 



A-l = (1 Oixn) = 



0 Coeif^jrsi(F)(j)i<i<n 

Onxl (doeffpgiig^^pj_^ U)l<.,3<n 



(7) 
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/ Null{F) 



( 8 ) 



By induction on the length of rational expressions, we prove that their rep- 
resentations, built from (0 and 6 using patterns of proposition 0 are the 
Glushkov ones and then verify (3) and □ 



5 Conclusion 

In this paper, we have given a new algorithm for computing an automaton with 
multiplicities. We have verified that the boolean Glushkov construction suits to 
the multiplicity case with some adaptations. This new algorithm permits us to 
reduce from two to three times the size of the automaton. 
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Abstract. The aim of this paper is to compare three efficient represen- 
tations of the position automaton of a regular expression: the Thompson 
e-automaton, the ZVC-stTuctwce and the .7^-structure, an optimization 
of the ZVC -structure. These representations are linear w.r.t. the size s of 
the expression, since their construction is in 0{s) space and time, as well 
as the computation of the set S{X, a) of the targets of the transitions by 
a of any subset X of states. The comparison is based on the evaluation 
of the number of edges of the underlying graphs respectively created by 
the construction step or visited by the computation of a set S{X,a). 



1 Introduction 

An efficient implementation of NFA’s computation is based on two main fea- 
tures: the data structure to represent the NFA and the process to compute the 
set S{X, a) of the targets of the transitions by a of an arbitrary subset X of 
states. In the general case, the S{X,a) sets are independent a priori and the 
NFA is memorized by a table whose (g, a)-entry is the set S{q,a). The com- 
plexity of a NFA is therefore generally measured by the size of its transition 
table. According to this convention, the position automaton PEI of a regular 
expression of alphabetic width n has the same O(n^) complexity as an arbitrary 
automaton with n states, whereas the common follow sets automaton has an 
0{nlog^{n)) complexity. 

However, this classical definition does not take in consideration the specific 
properties which allow to provide a more efficient implementation of some NFA’s 
families. It is the case when the NFA deduces from a regular expression: the syn- 
tactic structure of the expression induces dependencies among the S{X,a) sets. 
Thanks to this property, a shared representation of the information can be de- 
signed, which leads both to save memory space and to speed up the process to 
compute the S{X,a) sets. For example, the position automaton of an expression 
of alphabetic width n can be implemented with an 0(n) space and time complex- 
ity, leading to an 0(n) computation of any S{X, a) set. In this paper we examine 
three representations of the position automaton yielding this complexity: the 
Thompson e-automaton the ZVC-structure |lf)lll)H2'E] . and an optimiza- 
tion of the ZVC-structure, the .^-structure. We explain the relationship between 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 80-^31 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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these structures and compare the number of elementary operations involved by 
their construction and by the computation of the 5{X,a) sets. Notice that the 
relationship between the position automaton and the Thompson e-automaton 
has been studied in |^; our approach is a more algorithmic one: in particular, 
the Thompson e-automaton is compared to other linear representations rather 
than to the position automaton itself. 

In order to deepen the comparison, some assumptions are made concerning 
input regular expressions. These hypothesis are presented in the following sec- 
tion. In the Section 3, we recall the definition of the position automaton and 
the complexity of the computations performed on its table. The Section 4 is 
devoted to the 2^PC-structure: we present an inductive definition of this repre- 
sentation, and give an accurate analysis of its complexity. The construction of 
the Thompson e-automaton is recalled in the Section 5. The Section 6 provides 
a comparison between the ZVC-sivncinve of an expression and its Thompson e- 
automaton, based on the fact that the state graph of the e-automaton is deduced 
from the syntax tree of the expression. The .7^-structure and its complexity are 
described in the Section 7. In order to provide a visual comparison of the various 
constructions, the figures have been gathered in the Annex A. 

2 Hypothesis 

The complexity of a regular expression is generally measured by its size s, i.e. 
the length of its prefixed form, or by its (alphabetic) width w, i.e. the number 
of occurrences of alphabet symbols. An arbitrary regular expression, such as the 
argument of a pattern matching command, may contain an arbitrary number 
of empty set, empty word and Kleene star operator occurrences. Its complexity 
should be measured by s, which may be arbitrarily greater than w. It is the 
reason why the complexity of the structures we study are firstly given w.r.t. s. 
In practical applications, it is profitable to preprocess the input expression in 
order to reduce the number of empty set, empty word and Kleene star operator 
occurrences. We define reduced expressions and show that they have linearily 
dependent size and width. 

Definition 1. A regular expression E is said to be reduced if it is such that: 

- Either E is the expression % or E contains no occurrence of the empty set. 

- Either E is the expression e or the empty word only occurs in subexpressions 
F + e or £ + F, with F ^ e (‘+ ’ is denoted by ’ in these expressions). 

- Two consecutive operations in E cannot be both either an ’ or a ‘+^ ’ opera- 
tion. 

Proposition 1. Let E' be a regular expression of size s' and width w' . It is 
possible to construct a reduced regular expression E equivalent to E' , of size s 
and width w, such that: 

• The expression E is deduced from E' in O(s') space and time. 

• The expressions E and E' have an identical width: w = w' . 
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• If E and E ^ e, then E is such that: 

- The overall number of ’ and '+ ’ operators is equal to w — 1. 

- The overall number of ’ and '+e ’ operators is bounded by 2w — 1. 

- The size of E is such that: 2w — 1 < s < 6w — 3. 

The Thompson e-automaton and the ZVC-stmctmce are known to be linear 
structures w.r.t. s. If the input expression is a reduced one, the complexity can be 
expressed w.r.t. w and thus made more precise. Moreover, a detailed implemen- 
tation will be provided for each structure, in order to deduce an exact measure 
of the space. Lastly, the number of elementary operations will be bounded under 
the following assumption: 

Hypothesis Hi : The number of elementary operations to construct a structure 
(resp. to compute a 6{X,a) set) is proportional to the number of created (resp. 
visited) edges. 

3 The Position Automaton of a Regular Expression 

A regular expression is linear if and only if all its symbols are distinct. The 
following sets of symbols are associated to a linear expression E\ 

- Null{E) = {e} if e G L{E) and 0 otherwise, 

- the set First{E) of symbols matching the first symbol of some word in L{E), 

- the set Last{E) of symbols matching the last symbol of some word in L{E), 

- the sets Eollow{E,x) of symbols following x in some word of L{E), \fx G E. 

Now, let if be a regular expression over E. If a is the occurrence of a 
symbol in E, Oj is the position associated to a. The set of positions of E is denoted 
by Pos{E). The linear expression deduced from E by substituting each symbol 
by its position is denoted by E. Let h be the mapping from Pos{E) to E such 
that h{x) is the symbol related to the position x. The sets of positions associated 
to E are straightforwardly deduced from the sets of symbols associated to E\ 
Null{E) = Null{E), First{E) = First{E), Last{E) = Last{E) and, Vx G 
Pos{E), Follow{E,x) = Follow{E,x). 

The position automaton Pe of E is deduced from the sets Null{E), First{E), 
Last{E) and Follow{E,x) as follows. 

Definition 2. The position automaton of E, Ve = (<5, E,S,I,F), is defined by: 

• Q = Pos{E) U {0}, where 0 is not in Pos{E), 

./={ 0 }, 

^ p _ ( Last{E) if Null (E) = 0 

[ Last{E) U {0} otherwise 

• d(0, a) = {x G First(E) \ h{x) = a}, Va G E, 

• 6{x,a) = {y \ y G Follow{E,x) and h{y) = a}, Vx G Pos{E), Va G E. 

Remark 1. (a) An expression E such that Null{E) = {e} is said to be nullable. 
(b) For all position y, the label of each transition going into y is equal to h{y): 
the position automaton is said to be homogeneous. 
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The sets Null{E), First(E), Last{E) and Follow{E,x) can be computed 
according to recursive formulas similar to the following ones, which hold for the 
case E = F ■ G: 



Null{F ■ G) = Null{F) n Null{G) 
n- \ FirstiF)VJ FirstiG) if Null (F) 

= otherwise 

f Last{F) U Last{G) if Null{G) = 
\ Last{G) otherwise 

r Follow{F, x) U First{G) 
Follow{F, x) U Follow{G, x) 



Last{F ■ G) = 



Follow{F ■ G,x) = 



= M 

{4 

if a; G Last{F) 
otherwise 



The complexity of the position automaton is the following. The size of the 
table is bounded by w{w + 1). A direct computation of the Follow{E, x) sets is 
in O{vo^) time, hence an O(w^) construction of the table. The computation of 
S(x,a) can be performed in 0(w) time, via three means: the conversion of the 
expression into its star normal form |2j, the lazy transition evaluation defined 
in p], or the elimination of the redundant follow links in the iZVC-structure na 
cn). It leads to an O(w^) construction of the table. Moreover, the computation 
of S(X,a), where X is an arbitrary set of states, is performed in 0{w^) time 
using the table. 



4 The SPC-Structure of a Regular Expression 

The ZVC-stmctvxe is described in nscm. It is made of two copies of the syntax 
tree and of a collection of links connecting their nodes and implementing the 
computation of the First, Last and Follow sets of the subexpressions of E. 

4.1 Definition of the SPC-Structure 

Let Eq = % ■ (E) ■ #, where and ‘#’ are not in E, and let Pos{Eq) = 
{xq,Xi,X 2 , ■■■,x^,x^+i}, with h{xo) — ‘$’ and h{xu,+i) = ‘#’, be the set of 
positions of Eq. The ZPC-structure of E is defined by: ZVCe = zpc{Eo)', the 
inductive construction of zpc{E) is illustrated by Fig. 1. Notice that the defini- 
tion we give here is slightly different from the original one 1 1 31 1 1 )j . The links used 
to compute the Last sets have been discarded, since our main goal is to design 
an efficient pattern matcher. The graphical presentation has been modified too, 
to facilitate the comparison with the Thompson e-automaton. 

Let us give some indications concerning this construction. The two copies of 
the syntax tree of Eq associated to the structure ZPC e are denoted by Firsts{E) 
and Lasts{E). If F’ is a subexpression of Eq the corresponding node in Firsts{E) 
(resp. Lasts{E)) is denoted by (fp (resp. Xp)- As recalled below, the computation 
of 6(X, a) involves a bottom-up traversal of Lasts(E) and a top-down one of 
Firsts{E). Hence the following implementation of ZVCp, illustrated by Fig. 3: 
• an array Ipositions of size ui -I- 1, such that lpositions[k], k = 0 to w, is a, 
pointer to the leaf associated to the position Xk (notice it is not necessary 
to store a pointer to the leaf 
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• for a node A in Lasts{E): 

- a pointer Iparent to the parent node; if A is the root of a tree in the forest 
Lasts{E) then lparent{\) is set to NULL, 

- a pointer follow, from the son of a node (resp. the left son of a node 
in Lasts{E) to its copy (resp. the right son of its copy) in Firsts{E), 

- a boolean notvisited initialized to true before any S(X,a) computation; 

• an array f roots of size s, such that froot[i] is a pointer to the root of a tree 
in the forest Firsts{E); 

• for a node ip of Firsts{E): 

- the pointers fleftson and frightson deduced from the syntax tree, 

- the pointers /begin, fend and /next to compute First{ip), 

- an integer /tree, which is the index of the tree the node belongs to; 

• for each leaf associated to a position: 

- the character symbol and the integer frank associated to the position. 

4.2 Complexity of the Construction of ZT’Ce 

The forests Firsts{E) and Lasts{E) are generally drawn with distinct nodes. 
They can however be implemented with a shared set of s + 2 nodes. Each node 
is equipped with six pointers: fleftson, frightson, /begin, fend, Iparent and 
follow, and three data: the booleans null and notvisited and the index /tree. 
Moreover, each of the w + 1 leaves of Firsts{E) is equipped with two data: the 
pointer /next and the character symbol. Lastly, the size of the array Ipositions 
is equal to ?n+ 1, and the size of the array /roots, which is less than the number 
of operators is bounded by s if if is an arbitrary expression, and by w if if 
is reduced. Hence the following proposition: 

Proposition 2. Let e be the space taken up by the structure ZVCe- If the ex- 
pression is an arbitrary one, we have e ~ 8s + 3w. If the expression is a reduced 
one, we have e ~ 7s + Aw and ISrc < e < A6w. 

As far as the time is concerned, the computation of the boolean null must be 
taken in account. Conversely, some pointers are only computed for a subset of 
the set of nodes: the overall number of non-null fleftson and frightson pointers 
added with the number of trees in the forest Firsts{E) is equal to s — 1, and 
the pointers follow are only generated by the and nodes. According to the 
Proposition n the overall number of operators and and thus of non- null 
follow pointers, is bounded by 3w — 2. Lastly, the computation of each pointer 
is in 0(1) time. Under the hypothesis Hi, we get the following proposition: 

Proposition 3. Let t be the time taken up by the construction of the struc- 
ture ZVCe- If the expression is an arbitrary one, we have t ~ 7s -I- 3w. If the 
expression is a reduced one, we have t ~ 7s and lAw < t < 42w. 



Implicit Structures to Implement NFA’s from Regular Expressions 



85 



4.3 Computation of d{X, a) 

The computation of the S(X,a) sets on a ZVC-structure is examined in m 
0]. Since the position automaton is homogeneous, 5{X,a) can be deduced in 
0{w) time from i5(X) = S{X, a), according to the formula: S{X, a) = {y £ 
S{X) I h{y) = a}. Moreover, the set Y = S{X) can be computed by the following 
algorithm: 

Algorithm deltaZPC: 

• Step 1: Compute the set A of nodes A in Lasts{E) such that Last{X) C\X 
and there exists a follow link exiting from node A. 

• Step 2: Compute the set <P of nodes (p in Firsts{E) such that there ex- 
ists a follow link in A entering in p. The set Y = is such that Y = 

• Step 3: Deduce a set <!>' from (p so that the set Y is computed according to 
the formula: Y = l+J^g.^/ Eirst{(p). 

Example 1. Let E = ((a • (a -I- b + e))* ■ b)* and consider the structure ZVCe of 
Fig. 3. Let X = { 01 , 64 }. The following sets are computed: 

A = {A 5 , A 4 , A 3 , A 2 , Ai}, ^ = {ip6,P4,Pii,P2,P#}, ^ and Y = 

{ 01 , 02 , 63 , 64 ,#}. 

Let us examine the complexity of of the Algorithm deltaZPC. Let Lx be the 
set of the roots of the trees in Lasts{E) containing at least one node of A. For 
all A in Lx, Xx is defined as the partial subtree rooted in A and made of all the 
paths going from A to the leaves labelled by a position in X. Similarity, let Ex 
be the set of the roots of the trees in Eirsts{E) containing at least two nodes of 
(p. For all ip in Ex, px is defined as the partial subtree rooted in p and obtained 
by deletion of the subtrees rooted in a node belonging to <P. The number of edges 
in Ajf (resp. px) is denoted by |Ax| (resp. |:/7x|)- The size of the set A (resp. <P) 
is bounded by the number fx of follow links. 

Proposition 4. Under the hypothesis Hi, the computation time of the different 
steps of the Algorithm deltaZPC is the following: 

• Step 1: ti = 2J2xehx 

• Step 2: t 2 = fx, 

• Step 3: Computation of <P' : tg = J2(peFx \^^\’ Computation ofY: t'l < w -|- 
2fx- 



Remark 2. The 2 coefficient in ti is due to the necessary initialization of the 
boolean notvisited before each 6{X,a) computation. In Step 3, if a Firsts{E) 
tree exactly contains one node of this node is straightforwardly added to <P'. 
The computation of First{p), for all p in <P' , visits the two edges fbegin{p) and 
fend{p). The size of <P' is bounded by the size of <P. Moreover the computation 
of as a disjoint union implies to visit |T| < ic -|- 1 fnext edges. Hence the 
bound on the time t^. 
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5 The i- Automaton of a Regular Expression 

In an e-automaton, instantaneous transitions, i.e. transitions on the empty word 
e, are authorized. Thompson has designed in H3| the construction of an e- 
automaton recognizing the language of a given expression. We here present the 
definition of the i-automaton of an expression, a variant of the Thompson e- 
automaton. The interest of the i-automaton is that its structure is closely re- 
lated to the syntax tree of the expression, which makes the comparison to the 
2^PC-structure more visual. 



5.1 Definition of the i- Automaton 

The i-automaton {Q, S,iE,tE-,Si) of the expression E is defined inductively by 
the schemas of Fig. 2, where the unlabelled edges correspond to e-transitions. 
For each subexpression F of E, the inductive construction produces the two 
states ip and tp- The set of states of the i-automaton is the union of the set of 
states ip and of the set of states tp. 



5.2 Properties of the i- Automaton 

Let us first notice that the Thompson e-automaton can be viewed as an op- 
timization of the i-automaton: in the case E = F ■ G, Thompson merges the 
states ip and ip, as well as the states tp and to. The properties we state for 
i-automata in the following are well-known properties of Thompson e-automata. 
The relationship between the i-automaton and the position automaton is stated 
by the following proposition, where the e-closure of the state q is denoted by 
s{q). The proof is by induction on the size of E. 

Proposition 5. Let Pos{E) = {x\,X 2 , ...,Xp\ be the set of positions of E and 
{Q, Pos{E),iE,tp, Si) be the i-automaton of E. Let I (resp. T) the set of states 
ixk (resp. txf.) generated by the expressions Xk. The following properties hold: 

• Null{E) is equal to {e} iff there exists an e-path from ip to tp, 

• First{E) = £{ip) fl I, Last{E) = e~^(tp) fl T, 

• Follow{E,Xk) = {xt I ixi G s{txk)}, S{xk,a) = {xe \ G Si{e{txJr\I,a)}. 

The complexity of an i-automaton is as follows. An i-automaton can be rep- 
resented by a table with 2s entries. Each state (except tp) is the origin either 
of one symbol-transition, either of one or two e-transitions. Moreover, the com- 
putation of each transition is in 0(1) time. Therefore, the space and time taken 
up by the construction of the table are equal to 4s. More precisely, according to 
the Proposition nj and under the hypothesis Hi, it can be proved that the time 
is bounded by ISrc when the expression is reduced. Moreover, due to the fact 
that the number of edges going out of a state is bounded by 2, the computation 
of the set 5{X,a) is performed in 0(s) time. 
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6 ^T^C-Structure vs. i-Automaton 

We now use the fact that the i-automaton is deduced from the syntax tree of 
the expression to show similarities and differences with the ZVC-sirwcinve. 



6.1 The Structure of the i-Automaton 

Let Ie and Tg be two copies of the syntax tree of E. Let A/ (resp. Xt) be the 
set of nodes of Ie (resp. Te). Let Ge = {U,V,X U {e}) be a labelled digraph 
such that U = XjU Xt and V deduces from the set Ej of the edges ileftson 
and irightson of Ie, the set F!p of the edges tparent of Te and the following 
sets: 

- El : the set of the edges irightson in I e with a node ‘-’as origin, 

- E2 : the set of the edges tparent in Te associated to the tleftson edge of a 
node 

- i?3 : the set of the edges from the son of a node in Te to its copy in I e, 

- E4 : the set of the edges from the left son of a node in Te to the right son 

of its copy in Ie, 

- En : the set of the edges from a node in Ie to its copy in Te, 

- Eq : the set of the edges from a leaf (symbol, empty word) in to its copy 

in Te- 

We have: V = (Tj \Ei)U {T!p \ E2) U {E3 U E4U E^U Eq). All the edges are 
labelled by e except the symbol-edges of E^. The set V can be computed through 
a traversal of the syntax tree. The following proposition directly deduces from 
the inductive definition of the i-automaton. 

Proposition 6. The graph Ge is isomorphic to the state graph of the i- 
automaton of E. 

The comparison between Ge and ZVCe yields the following similarities: 

• Identical sets of nodes (as far as ZVCe is defined on two distinct copies). 

• Identical sets of edges connecting the second copy to the first one: the set 
E-iiJ E 4 in Ge is equivalent to the set of follow links in ZVCe- 

• Closely related sets of syntactic edges: 

- the sets of ileftson edges and of fleftson edges are equivalent, 

- the set of tparent edges is equivalent to the set of Iparent edges minus the 
edges (Af, Ab) in the case E = E ■ G A Null{G) = {e}, 

- the set of irightson edges is equivalent to the set of frightson edges minus 
the edges (pe, Pg) in the case E = F ■ G A Null{E) = {e:}. 

On the other hand, the main differences are the following: 

• The handling of the positions: in Ge, each leaf of Ie associated to a position 
is connected to the corresponding leaf of Te by an edge labelled by the associ- 
ated symbol. In ZVCe, the array Ipositions provides the adresses of the leaves 
associated to the positions (which are shared by Firsts{E) and Lasts{E)). 

• The processing of the nullable subexpressions: in Ge, a subexpression F is nul- 
lable iff there exists at least one e-path from iE to If- The existence of such 
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a path is related to the e-edges connecting e-leaves in Ie to the correspond- 
ing leaves in T^;, and to the edges of the set coming from the processing 
of the starred expressions. In ZVCe, the boolean null{F) is computed for each 
subexpression F. 

• The computation of the set First{F), for F a non nullable subexpression of 
E-. in Qe, we have: First{F) = e{iE) H Y ; the computation of First{F) implies 
the exploration of the subtree of Ie rooted in ie- In ZVCe, First{F) directly 
deduces from the pointers fbegin et fend associated to F and from fnext links. 



6.2 Complexity of the Construction of Qe and ZVCe 

The two constructions are in 0(s) space and time. The following table makes 
this result more precise, by giving the space taken by each structure, and the 
construction time under the hypothesis Hi. The input is either an arbitrary 
expression, or a reduced one. Identical assumptions have been made to evaluate 
the complexity of the two constructions (cf. Propositions and . 





arbitrary expression 


reduced expression 




Qe 


ZVCe 


Qe 


ZVCe 


space 


4s -1- 2w 


8s -1- iw 


lOic < e < 26w 


18w < e < A&w 


time 


4s -1- 2w 


7s 3w 


6w <t < 17w 


lAw <t< 42ru 



The construction of ZVCe is about two times more expensive than the con- 
struction of Qe- The question is to know whether the additional information it 
computes allows a fairly large speedup of S{X, a) sets computation or not. 



6.3 Computation of ^(X, a) 

The comparison of the operations performed to compute the sets S(X,a) in Qe 
and in ZVCe is made possible by the following proposition: 

Proposition 7. Let p and p' be two nodes in the same forest of ZVCe and let 
r and r' be the corresponding nodes in Qe- The following property holds: there 
exists a path from p to p' iff there exists an e-path from r to r' . 

Proof. Let us first verify that if there does not exist a link tparent{tc) , to fy Ie, 
then we have: there exists a link lparent{Xa) = iff there exists an e-path 
from to to tp- There does not exist a link tparentfta) , ta fy tE, iff F = G- H.ln 
this case, there exists a link lparent{Xo) iff Null{H) = {e}. On the other hand, 
there exists an e-path from ip to tp iff Null{H) = {e}. Moreover, there exists 
a follow link from Iq to ip- Finally, we get: there exists a link Iparent^Xc) iff 
there exists an e-path from to to tp- It can be proved in a similar way that 
if there does not exist a link irightson(tp) then we have: there exists a link 
frightson(ipp) = pp iff there exists an e-path from ip to ip- □ 
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Example 2. The Figures 3 and 4 which respectively represent the 2PC-structure 
and the t/-structure of the expression E = ((a - {a + h + e))* • 6)* show that: 

“ The e-path (ts, zg, is, iiO) fiO; ^6; ^ 4 ) acts as the Iparent link (A 5 ,A 4 ). 

- The e-path (z 2 , * 3 , ts, * 11 ) acts as the frightson link 

- There exists no e-path from to t 2 and no Iparent link from A 3 to A 2 . 

- There exists no e-path from 14 to zg and no frightson link from (^4 to (/?g. 

Finally, the computation of a) in Qe amounts to compute sets which 
are equivalent to A, and Y sets, as in the Algorithm deltaZPC, and with the 
following complexity: 

• Step 1: The number of visited edges is greater in Te than in Lasts(E) due to 
the e-paths which act as Iparent or frightson links. 

• Step 3: (a) Since the boolean notvisited is used in Ie, each edge must be 
visited a second time due to a necessary re-initialization step. 

(b) If a tree in Firsts{E) contains only one node (f> of <P, then First(ip) is 
obtained directly from the pointers fbegin(ip) and fend(ip) and from the fnext 
linking. On the opposite, the corresponding subtree in /e is explored. 

(c) If a tree in Firsts{E) contains at least two nodes of the subsets of edges 
of this tree respectively visited in each structure are complementary. 

This comparison can be summarized as follows: 

Proposition 8. The average computation time of S{X,a) is smaller when per- 
formed on the structure ZVCe than on the structure Qe; the ratio between the 
average number of edges visited in Firsts(E) and in I e is equal to 2/3. 



7 The Forest-Structure of an Expression 

From now on, we assume that if is a reduced expression and that / is the 
number of follow links. The Figure 5 illustrates the construction of Te, the 
forest-structure of if = ((a • (a -I- 6 -I- e))* • b)* . 

The structure Te deduces from ZVCe by the following optimizations: 

• The handling of empty word occurrences: Since if is a reduced expression, 
we consider the unary operator ‘o’ such that: Null{F^) = {e}, First{F^) = 
First{F), Last(F^) = Last{F) and, Mx G Pos{F), Follow{F^,x) = 
Follow{F,x), and F* is substituted to each occurrence of F -|- £ and £ -|- F. 
Since there are no longer leaves associated to the empty word, the pointers 
f begin and fend are necessarily non-null, hence a faster computation. Moreover 
the size of a reduced expression is now bounded by Aw — 2. 

• The compression of the forests Firsts(E) and Lasts{E): The forest Lasts{E) 
(resp. Firsts{E)) is compressed so that only the heads (resp. the tails) of follow 
links are kept. The following properties hold: 

- the compressed forests do not necessarily share the same set of nodes, 

- the number of sons of a node may be more than 2, 

- in each forest, the number of nodes is equal to / and the number of leaves 
is bounded by zc -I- 1 . 
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Proposition 9. Let e (resp. t) he the spaee (resp. time) taken up by the eon- 
struetion of the Te strueture. We have: e and t are approximatively equal to 
7/ + 3iy and 3ui < e,t < 24w. 

Proposition El holds under the same assumptions as for the analysis of the 
structures Ge and ZVCe- Moreover, the Algorithm deltaZPC can be readily 
made suitable for a J^e structure. Due to the reduction of the size of the forests, 
the computation of a 5{X,a) set is faster on Te than on ZVCe- 

8 Conclusion and Perspectives 

The comparative analysis of the structures Ge, ZVCe and Te has been con- 
ceived has a preliminary work to their programmation and to an experimental 
study of the performances. For the construction step, the main results are the 
following: 





reduced expression 




Ge 


ZVCe 




space 


lOtc < e < 26w 


18w < e < A&w 


3w < e < 24w 


time 


6w <t < 17w 


lAw <t< 42w 


3w < t < 24w 



Concerning the computation of the S{X,a) sets, we have shown that Ve is 
naturally more efficient than ZVCe, which is more efficient in average than Ge- 
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ANNEX A: Figures 




E=F • G Null(F)= 0 Null(G)= 




F=F ■ G Null(F)=Null(G)= 




syntax links 
follow links 



fbext links 
fbegin and fend links 



nullable expression 



Fig. 1. The inductive definition of the structure zpc{E). 




Fig. 2. The inductive definition of the i-automaton of a regular expression. 
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^ syntax links 


- - ->• fnext links 


— follow links 


^ ; X > Ipositions array 


> missing syntax links 


fbegin and fend links 


follow link from $ 




nullable expression 


(of follow link tails) 


follow link to # 


: froots array 



Fig. 3. The strncture ZVCe ol E — {{a ■ (a + & + e))* • &)*. 




^ syntax links — links aded by (the empty word is the default label) 

. . 

> missing syntax links the construction 



Fig. 4. The i-antomaton of E = ((o • {a + b + e))* ■ 6)*. 




Fig. 5. The structnre Eb of E = ((a • (a + 6 + e))* • 6)*. 
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Abstract. Two classical constructions to convert a regular expression 
into a finite non-deterministic automaton provide complementary ad- 
vantages: the notion of position of a symbol in an expression, introduced 
by Glushkov and McNaugthon-Yamada, leads to an efficient computa- 
tion of the position automaton (there exist quadratic space and time 
implementations w.r.t. the size of the expression), whereas the notion of 
derivative of an expression w.r.t. a word, due to Brzozowski, and gen- 
eralized by Antimirov, yields a small automaton. The number of states 
of this automaton, called the equation automaton, is less than or equal 
to the number of states of the position automaton, and in practice it is 
generally much smaller. So far, algorithms to build the equation automa- 
ton, such as Mirkin’s or Antimirov’s ones, have a high space and time 
complexity. The aim of this paper is to present new theoretical results al- 
lowing to compute the equation automaton in quadratic space and time, 
improving by a cubic factor Antimirov’s construction. These results lay 
on the computation of a new kind of derivative, called canonical deriva- 
tive, which makes it possible to connect the notion of continuation in 
a linear expression due to Berry and Sethi, and the notion of partial 
derivative of a regular expression due to Antimirov. A main interest of 
the notion of canonical derivative is that it leads to an efficient compu- 
tation of the equation automaton via a specific reduction of the position 
automaton. 



1 Introduction 



Converting a regular expression into a finite automaton is a basic operation 
implemented inside many standard software tools such as compilers, context 
search requests, pattern matching engines, document processing packages and 
protocol simulators. Watson recently published a taxinomy on this topic im, 
which aroused numerous theoretical and algorithmic developments j1 2lhlfil1 ™ 
II 6I,'tISI 4I‘2I1 H1 1 j since half a century. 

Two notions turn out to be very suitable to carry the conversion of a reg- 
ular expression into a non-deterministic automaton. The notion of position of 
a symbol in an expression, used in Glushkov |2| and McNaughton-Yamada m 
algorithms, leads to the computation of the classical position automaton of the 
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expression. The notion of partial derivative of an expression w.r.t. a word, in- 
troduced by Antimirov j2j, generalizes the notion of word derivative due to Br- 
zozowski 0, and leads to the computation of the equation automaton of the 
expression. 

For a given expression, with w occurrences of symbols, the number of states 
is equal to w -I- 1 in the position automaton, and is less than or equal to w -I- 1 
in the equation automaton. Furthermore, in current applications, such as lex- 
ical analysis, the equation automaton can be much smaller than the position 
automaton. On the other hand, there exist several efficient implementations of 
the position automaton: the algorithms described in !4IHIU)I1,^| have an O(s^) 
space and time complexity, where s is the size of the expression; on the oppo- 
site, there exist only two algorithms to compute the equation automaton, due 
to Mirkin and to Antimirov |2], and they have a high space and time com- 
plexity: O(s^) space and O(s^) time for Antimirov’s construction. The challenge 
is thus to design an algorithm to compute equation automata with the same 
efficiency as for computing position automata. 

The notion of canonical derivative (or c-derivative) we introduce here allows 
one to study thoroughly the relation between the position automaton and the 
equation automaton and therefore leads to such an algorithm. The connection 
between the position automaton and the notion of word derivative has been stud- 
ied by Berry and Sethi jS|, who showed that a set of similar word derivatives 
is associated to each state of the position automaton (two similar derivatives 
deduce from each other by using associativity, commutativity and idempotence 
properties of the -I- operation) . Canonical derivatives lead to a refinement of this 
property: we show that if two word derivatives of a linear expression (an expres- 
sion where symbols are distinct) are similar, then the corresponding c-derivatives 
are identical, and we deduce that a unique expression, called a c-continuation, is 
associated to each state of the position automaton. Hence the definition of the 
c-continuation automaton, which contains the position automaton. 

On the other hand, c-derivatives are connected to Antimirov’s partial deriva- 
tives PI in the following way: consider the set of c-continuations in the linearized 
version if of a regular expression E, and assume two c-continuations are equiv- 
alent if and only if they are linearizations of the same expression. Then the set 
of c-continuations in E is equal, modulo this equivalence, to the set of partial 
derivatives of E. Hence the computation of the equation automaton by reduction 
of the c-continuation automaton. Let us point out that these theoretical results 
are not related to the work of Hromkovic et al. mi: firstly, the common follow 
sets automaton they define has more many states than the position automaton, 
and secondly they do not use any derivative-like tool; on the opposite, our results 
are based on a new algebraic tool: the c-derivatives. 

These theoretical results have interesting algorithmic applications. It is shown 
in [ 7 ] that an implicit computation of the set of c-continuations and of the 
quotient set can be achieved with an O(s^) space and time complexity, which 
leads to a new quadratic implementation of both the positions and the equa- 
tion automata and significantly improves the O(s^) complexity of Antimirov’s 
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algorithm 0. Notice that the techniques we use to handle c-continuations are 
necessarily different from the procedures used in UDI to implement the common 
follow sets automaton. On the other hand, some refinements to compute the set 
of transitions lay on the properties of the implicit structure imT^ we designed 
in the past to represent the position automaton; a very closely related structure 
is used in m- 

Our investigations are illustrated by the following diagram. Our contribution 
is represented in bold font. 



£ 1 1 E I =s) 




Partial 

derivatives 

[ 2 ] 



Section 2 recalls classical notions of automata theory. Section 3 summarizes 
theoretical results concerning c-derivatives and their relation with word deriva- 
tives. Relation with partial derivatives is examined in Sect. 4. The c-continuation 
automaton is introduced in Sect. 5, and the way it is connected to the position 
automaton and to the equation automaton is examined. 

2 Preliminaries 

We first recall terminology and basic results concerning regular languages and 
expressions, and finite automata. For further details about these topics, we refer 
to classical books m. 

Let if be a non-empty finite set of symbols, called the alphabet, and S* be 
the set of all the words over S. Let e be the empty word. A language over E 
is a subset of S*. A regular expression over the alphabet E is 0, or 1, or a 
symbol Xi G E, or is obtained by recursively applying the following rules: if F 
and G are two regular expressions, the union {F + G) oi F and G, the product 
{F ■ G) or FG oi F and G, and the star (F*) of F are regular expressions. 
The regular language L{E) denoted by a regular expression E is such that: 
L(0) = 0, L(l) = {e}, L{xi) = {xj Vx, e A, L{F + G) = L{F) U L{G), 
L{F ■ G) = L{F)L{G) and L{F*) = L{F)* . Following |2|, let T be the algebra 
of terms over the set if U {0, 1}, with the symbols of function *, -I-, •, where * is 
unary and -I- and • are binary. Syntactical properties of the constants 0 and 1, 
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and of the operators *, + and • lead to identification rules on the set of terms. 
In the following, we shall consider that the classical rules involving 0 and 1 hold: 
Q + E = E = E + {), l-E = E = E-\, Q-E = Q = E-Q. Associativity, 
commutativity and idempotence properties of the + operation, called aci-rules, 
are captured by the notion of similarity (^: two regular expressions F and G 
are aci-similar {F ^ad G) if and only if they reduce to the same expression by 
applying aci-rules. 

We define X{E) = 1 ife S L{E), 0 otherwise. A regular expression E such that 
X{E) = 1 is said to be nullable. We note E = F when the expressions E and 
F are identical. The size of E is noted \E\. The alphabetic width of E, i.e. the 
number of symbol occurrences in E, is noted ||i?||. 

A regular expression E is linear over S if and only if every symbol of S occurs 
at most one time in E. If x is the occurrence of symbol in E, the pair (x,j) 
is called a position of E, written xj. The set of positions of E is denoted by 
Pos{E). The expression E over Pos{E) deduced from E by replacing symbol x 
in place j by Xj, for all j in [1, ||if||], is called the linearized version of E. E 
is linear over Pos{E). We denote h the mapping from Pos{E) to E such that 
h{xi) = X, Vi G [1, |lif||], and h{E) = E. Let E = a ■ {a + b) + {a + b) ■ {I + b). 
We have: Pos{E) = {m, 02, 63, 04, 65,^6}, E = ai- {02 + bs) + (04 -I- 65) • (1 -I- be), 
h{ai) = h{a 2 ) = h{ai) = a and h{b'i) = h{be) = h{be) = b. 

A finite automaton over if is a 5-tuple A4 = (Q, E, /, T, S) where Q is the set of 
states, I is the subset of initial states, T is the subset of final states, and 5 is the 
transition function. A4 is deterministic (A4 is a DFA) if |/| = 1 and if 6 maps 
Q X E to Q. Otherwise Ad is a NFA and <5 maps Q x if to 2^. 

3 Canonical Derivatives 

Let if be a regular expression, a be a symbol in E and u = iti . . . be a word in 
if*. The word derivative u~^E of E w.r.t. u is recursively defined as follows jn|: 

a-iO = a-^ = 0 

a“^x = 1 if a = X, 0 otherwise 
a-\F + G)=a-^F + a-^G 

a~^{F ■ G) = a~^F ■ G if A(F) = 0, a~^F ■ G + a~^G otherwise 
a-\F*)=a~^F-F* 

e~^E = E and (ui...u„) ^E = (rt2 . . . u„) ^(ui“^if) 

We introduce the notion of canonical derivative of a regular expression. The 
aim is to make it easy to compute derivatives of a linear expression. Let us 
recall that we assume that the classical rules involving involving 0 and 1 hold: 
0 + E = E = E + 0,l-E = E = E-l, 0-E = 0 = E-0. 

Definition 1 (c-derivative). The c-derivative du{E) of E w.r.t. u is recur- 
sively defined as follows: 

da(0) = da(l) = 



0 
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da(x) = 1 if a = X, 0 otherwise 
da{F + G) = da{F) if da{F) ^ 0, da{G) otherwise 
da{F ■ G) = da{F) ■ G if da{F) 0, X{F) ■ da{G) otherwise 
da{F*) = da{F) ■ F* 

de{F^ — E and dui...Un{F^ — du 2 ...Uni.dui{F^^ 

The following property is fundamental for handling c-derivatives of a linear 
expression. It deduces from Definition H by induction on the length of u and on 
the structure of E. 

Proposition 1. The c-derivative du{E) of a linear expression E w.r.t. a word 
u ofS+ is either 0 or such that: 



du{u) = 1 

du{F + G) = du{F) if du{F) yf 0, du{G) otherwise 
du{,F)-G ifdu{E)^Q 

ds{G) otherwise (s ^ e is some suffix of u) 
du{F*) = ds{F) ■ E* (s ^ e is some suffix of u) 



du{F-G) = 



( 1 ) 

( 2 ) 

( 3 ) 

( 4 ) 



Proposition Estates the properties of c-derivatives of a linear expression, and 
their relations to word derivatives. 



Proposition 2. Let E be a linear expression, a be any symbol in E, u and v be 
any word in S* . The following properties hold: 

(a) A non-null c-derivative of E is either 1 or a subexpression of E or a product 
of subexpressions. 

(b) da{E) = a~^E 

(c) du{E^ ^aci ^ F 

(d) u~^E r^aci V~'^E du{E) = d.u{E) 

Berry and Sethi jS] have proved that for any symbol a of a linear expres- 
sion E, the non-null derivatives {ua) E, for all u in E*, are aci-similar. From 
Proposition E (d) we deduce the following theorem: 

Theorem 1. The set of non-null c-derivatives dua{E) of a linear expression E 
reduces to a unique expression, the c- continuation of a in E, denoted Ca{E). 



4 Canonical Derivatives and Partial Derivatives 

Given a regular expression E and a symbol a, the set da{E) of partial derivatives 
of E w.r.t. a is recursively defined on the structure of if as follows P): 

5a(0) = 9a(l) = 0 

da{x) = {1} if a = x, 0 otherwise 
da{E+G)=da{E)Uda{G) 
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r 0 if G = 0 

a„(i^-G) = <^ -G if G 0 and A(F) = 0 

[ dalP) ■ G U da{G) otherwise 
da{F*) = da{F) ■ F* 

Any derivative of E can be represented by a finite set of some partial deriva- 
tives of E. For example, the two partial derivatives of ab+ a* w.r.t. a are b and 
a* and a~^(ab + a*) = b + a*. Partial derivatives extend to words according to 
the formulas dg(E) = {E} and dua(E) = da{du{E)). 

We now examine the relation between c-derivatives and partial derivatives. 
Let if be a regular expression over if, if be the linearized version of E over 
Pos{E) and h be the mapping it induces from Pos{E) onto E. Let Ca be the 
set of the images by h of the non-null c-derivatives of E w.r.t. positions such 
that h{ai) = a. Proposition Elsays that Ca is the set of partial derivatives of E 
w.r.t. a, and Proposition 0 is used to extend this property to words. 

Proposition 3. Let E be a regular expression. Let He be a subexpression of E. 
Denote by P{He) the property: ai G Pos{E){He) A h{ai) = a A da^iHE) ^ 0. 
Then for all He one has: 



U h{da,{HE)) = da{HE) 

P{Hb) 

Let if be a regular expression and E = du{E) be a non-null c-derivative of E. 
We consider the linearization h{E) of h{F) and denote by h' the mapping from 
Pos{h{F)) onto Se it induces. For example if if = (ab+ b)* and F = da^{E) = 
^2(0162 -I- 63)*, we have: h{F) = b{ab + b)* = 6i(a2&3 -I- 64)*. 

Since F = Hi ■ H 2 ■■■ Hi, where Hi is a linear subexpression of if, for 1 < i < I, 
we have h{F) = H[ ■ H '2 ■ ■ ■ H[ and Hi and H[ are two linearizations of the 
same expression. In our current example, we have: H2 = (0462 -I- 63)* and iJ^ = 
(0263 -I- 64)*- 

Let Qj be the symbol of h{F), and be the symbol of F. Notice 

that two distinct symbols of h{F) may be mapped to the same symbol of F: 
for instance, p{bi) = pib^) — 62. Proposition 2] shows that it is equivalent to 
compute the c-derivative of F w.r.t. ai, and the c-derivative of h{F) w.r.t. any 
of the Gj such that p{aj) = a^. 

We first state a lemma which is useful for the proof of this proposition. This 
lemma deduces from Proposition^ 

Lemma 1. Let du{E) = Hi ■ H 2 ■ ■ ■ Hi be a non-null c-derivatiue. For all i and 
j sueh that 1 < i < j < I, if Pos{E){Hi) fl Pos{E){Hj) ^ 0, then there exists s 
a suffix of u such that d„(if) = ds{Hj) ■ F/j+i ■ ■ ■ Hi. 



Proposition 4. Let E he a regular expression, F = d«(if) be a non-null c- 
derivatiue of E, and h' be the mapping associated to h{F). Let ai be a position 
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of E, and Oj be a position of h{F) such that: pL{aj) = at, h'{aj) = h{ai) and 
daj{h{F)) 7 ^ 0. Then there exists m, 1 < m < I, such that: 

da,iWi)=daAHL)-HL+i---Hl 

daAF) = da^Hm) ■ ■ ■ ■ Hi 

h'ida.iW))) = HdaAF)) 



Theorem 2. Let E be a regular expression. Let He be a subexpression of E. 
P{u, He) denotes the property: v = u € Pos{E)*{He) A h{v) = u A dy{HE) ^ 0. 
Then for all subexpression He of E, one has: 

[J h{du{HE)) = du{HE) 

P(He,u) 

Proof. The proof is by induction on the length of u. By Proposition 0 it is true 
for symbols. Assume now that the proposition is true for words u of length less 
than some integer n, n > 1, and prove it for words ua of length n. Denote by F 
the c-derivative d^iHE). By Proposition 0 we have: 

y h{duaA~H^))= U h'{da,{W))) 

P(ua,HB) P{u,Hb), P(ajMF)) 

By Proposition 0 we get: 

U h'{da,iHF))) = da{ U M^)) 

P{u,He), P(aiMF)) P{u,He) 

□ 



5 Finite Automaton Constructions 

This section enlightens the connections between three NFAs recognizing the 
language of a regular expression E: the position automaton Ve, the equation 
automaton £e and the c-continuation automaton Ce. The construction of these 
automata is illustrated by Fig. 0 

5.1 The Position Automaton 

Let E, Pos{E), E and h be defined as usual. L{E) is recognized by the automaton 
Pe |11I12| . deduced from the positions sets First, Last and Follow. First(E) 
(resp. Last{E)) is the set of positions that match the first (resp. the last) symbol 
of some word in L{E). Follow{E,x), for all x in Pos{E), is the set of positions 
that follow the position x in some word of L{E). 

Definition 2 (position automaton). The position automaton of E, Ve = 

{Q, E,i,T,6), is defined by: Q = Pos{E) U {0}, i = {0}, T = [if \{E) = 0 then 
Last{E) else Last{E) U {0}], <5(0, a) = {x G First{E) \ h{x) = a}, Va G E, and 
5{x, a) = {y \ y G Follow{E, x) and h{y) = o}, Va; G Pos{E) and \/a G E. 
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5.2 The Equation Automaton 

Antimirov proves in |2| that the cardinality of the set VT>{E) of all partial 
derivatives of a regular expression E is less than or equal to ||E|| + 1. Hence 
the definition of Se, the equation automaton of E, whose states are the partial 
derivatives of E, and which recognizes L{E). 

Definition 3 (equation automaton). The equation automaton of a regular 
expression E, £e = {Q, E,i,T,S), is defined by: Q — V'D{E), i = E, T = {p \ 
\{p) = 1} and 5{p,a) = da{p), Vp € Q and Vo G E. 

5.3 The c-Continuation Automaton 

Let E, E and h be defined as usual. We assume 0 is a symbol not in Pos{E). Let 
Co = de{E) = E and Cx be an abbreviation of Cx{E). According to Theorem [Q 
the set of the c-continuations of the positions in E is finite. Thus we consider 
the non-deterministic automaton Ce, called the c-continuation automaton of E, 
whose states are pairs {x, Cx) with x in Pos{E) U {0}, and whose transitions are 
deduced from the computation of c-continuations Cx ■ 

Definition 4 (c-continuation automaton). The c-continuation automaton 
of E, Ce — {Q, E,i,T,6), is defined by: Q = {{x,Cx)\x € Pos{E) U {0}}, i = 
(0,co), T = {{x,Cx) I X{cx) = 1}, 6{{x,Cx),a) = {{y,Cy) \ h{y) = a and dy{cx) = 
Cy}, Va; G Pos{E) U {0} and Va G S. 

5.4 Prom Ce to Ve and £e 

We now show that Ce and Pe are identical, as far as states of Ce are viewed as 
positions. More formally, let p : Q — >• Pos{E) U {0} such that p{x, Cx) = x. The 
automaton p{Ce) deduces from Ce by replacing Q by Pos{E) U {0}. 

Theorem 3. The automata p{Ce) and Ve of a regular expression E are iden- 
tical. 

As a corollary, we deduce that Ce recognizes L{E). The proof deduces from 
the following proposition: 

Proposition 5. Let E be a regular expression. The following equalities hold: 

(1) First(E) = {y € Pos{E) \ dy{E)^y^0}; 

(2) Last{E) = {y G Pos{E) \ \{cy{E)) =J.}; 

(3) Follow{E,x) = {y G Pos{E) \ dy{Cx{E)) 0}. 

We now explain how to deduce Ee from Ce. Let ~ be the equivalence rela- 
tion on the set of states of Ce, defined by: (x,Cx) ~ (y, Cy) O h{cx) = h{cy). 
Notice that two states (x, Cx) and (y, Cy) such that Cx ^ Cy can be equivalent, as 
illustrated by the expression E = de-\- fe, which is such that = 62 yf 64 = c /3 
and h{cdi) = e = h{cf^). 
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Let [cx] denote the class of the state {x,Cx)- From Theorem El we deduce 
that the relation ~ is right-invariant, i.e. for all a in S, for all {x, Cx), (t, Ct) in Q 
such that {x,Cx) ^ we have: (5((a;, Ca,), a))/,^ = {6{{t,Ct),a)))/^. Hence 

the definition of the quotient automaton Ce! 

Definition 5 (quotient automaton). The automaton C e / T, 
is defined as follows: Qr^ = {[ca,]|x S Pos{E) U {0}}, i = [cq], T = {[cx] \ 
X(cx) = 1}, and [cy] € 6{[cx],a) 3c^ | S [cy], h{z) = a and dz(cx) = Cz}, 

V[ca;], [cz] S Qr~. and Va S E. 

The proof of the following theorem is based on Theorem |3 

Theorem 4. Let E be a regular expression. The quotient automaton Ce! is 
isomorphic to Se, the equation automaton of E. 

Theorems El and 0 deepen the relations between the automata Ve, £e, Ce and 

CeI~- 





Fig. 1. Ce — Ve and Ce/~ — Se for E = {{x*y)* + x{x*y)*y)*. 
Furthermore, they originate new algorithms for the construction of the posi- 
tion automaton and the equation automaton. These algorithms are described in 

0 

6 Conclusion 

The notion of canonical derivative of a regular expression allows to connect the 
notions of continuation in a linear expression and of partial derivative of a regular 
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expression; it leads to an efficient computation of sets of continuations and sets of 
partial derivatives. Thus it is a suitable tool for designing algorithms to convert 
a regular expression into a finite automaton: it can be used to compute both the 
position automaton and the equation automaton of a regular expression with an 
O(s^) space and time complexity. We are looking for relations over the set of 
c-continuations which would lead to smaller automata, with the same quadratic 
complexity. 
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ANNEX A: Example 

Let E = {{x*y)* + x{x*y)*y)*. The automaton Ee of Fig. Q] is computed by the 
program ProgCtoE as follows. 

E= ( (x* . y ) *+x . (x* .y)*.y)* 

size=16 alphabetic-width=6 #stars=5 alphabet=xy01+ . * 

Star subexpressions preprocessing (alphabet "xy01+.*") 

Before identification : 5 star(s) 
sl=xl* 

s2=(xl*.y2)* 

s3=x4* 

s4=(x4*.y5)* 

s5=((xl*.y2)*+x3. (x4* . y5) * . y6) * 

After identification : 3 star(s) 

Sl={5}- : ((x*.y)*+x. (x*.y)*.y)* 

S2=-[2,4} : (x*.y)* 

S3=-[l,3} : X* 



Computation of the pseudo-continuations (alphabet "xy01+ . *SSS") 

Before identification : 7 pseudo-continuation(s) 

10=S1 

ll=S3.y2.S2.Sl 

12=S2.S1 

13=S2.y6.Sl 

14=S3.y5.S2.y6.Sl 

15=S2.y6.Sl 

16=S1 

After identification : 5 image (s) by mapping h 

L0={0,6} : SI 

Ll={3,5} : S2.y.Sl 

L2={2} : S2.S1 

L3={4> : S3.y.S2.y.Sl 

L4={1> : S3.y.S2.Sl 



Initial state: 
Transitions : 


0 


State 0 


X 


1 4 


State 1 


X 


3 


State 2 


X 


1 4 


State 3 


X 


3 


State 4 


X 


4 



The equation automaton 

Final state(s) : 



0 2 
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Abstract. Several compression methods of finite-state automata are 
presented and evaluated. Most compression methods used here are al- 
ready described in the literature. However, their impact on the size of 
automata has not been described yet. We fill that gap, presenting results 
of experiments carried out on automata representing German, and Dutch 
morphological dictionaries. 



1 Introduction 

Finite-state automata are used in various applications. One of the reasons for 
this is that they provide very compact representations of sets of strings. However, 
the size of an automaton measured in bytes can vary considerably depending on 
the storage method in use. Most of them are described in j^, a primary refer- 
ence for all interested in automata compression. However, 0 does not provide 
sufficient data on the influence of particular methods on the size of the result- 
ing automaton. We investigate that in this paper. We used only deterministic, 
acyclic automata in our experiments. However, the methods we used do not de- 
pend on that feature. The automata we used were minimal (otherwise the first 
step in compression should be the minimization). 

Our starting point is as follows. An automaton is stored as a sequence of 
transitions (fig.^. The states are represented only implicitly. A transition has a 
label, a pointer to the target state, the number of transitions leaving the target 
state {transition counter)^ and a final marker. We use the transition counter 
to determine the boundaries of states, instead of finding them by subtracting 
addresses of states in a large vector of addresses of states as in 0, because we 
can get rid of the vector. 

We store the final marker as the most significant bit of the transition counter. 
We use non-standard automata, automata with final transitions (see |2|), because 
they have less states and less transitions than the traditional ones. But since the 
storage methods are the same, the results are also valid for traditional automata. 

2 Compression Techniques 

Compression techniques fall into three main categories: 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 105- ITT^ 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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Fig. 1. Starting point storage method, i is a label, F marks final transitions, ^ is the 
number of outgoing transitions in the target state, and — >■ is a pointer to the target 
state. The automaton recognizes words say and stay. 



— coding of input data, 

— making some parts of an automaton share the same space, 

— reducing the size of some elements of an automaton. 

Some techniques may contribute to the compression in two ways, e.g. chang- 
ing the order of transitions in a state can both make sharing some transitions 
possible and reduce the size of some pointers. The first category depends on the 
kind of data that is stored in the automaton. The techniques we use apply to 
natural language dictionaries. 

We define a deterministic finite-state automaton as A = {E, Q, i, F, E), where 
Q is a finite set of states, i G Q is the initial state, F C Q is a set of final states, 
and EFQxExQ is a set of transitions. We also define a function bytes that 
returns the number of bytes needed to store its argument: bytes{x) = [log 25 g • 
Total savings (in percents) achieved by using a particular method M on the 
starting point automaton are r]^{A) = 100% • (A) ■ 7r^{A) j sizeof{A), where 

(A) is the number of transitions affected by the compression method, {A) 
is the saving in bytes per affected transition, and sizeof(A) is the size of the 
automaton in bytes. The size of an automaton in the starting point representa- 
tion is \E\{2 ■ bytes\E\ + bytes{\E\)). In all those calculations, we assume that 
additional one-bit flags they require fit into the space taken by a pointer without 
the need to enlarge it. 



2.1 Coding of Input Data 

This section applies to natural language morphological dictionaries. Entries in 
such dictionaries usually contain 3 pieces of information: the inflected form, the 
base form, and the categories associated with the inflected form. It is common 
(e.g. in INTEX jH], PO], and in systems developed at the University of Campinas 
0) to represent that information as one string, with the base form coded. The 
standard coding consists of one character that says how many characters should 
be deleted from the end of inflected form so that the rest could match the 
beginning of the base form, and the string of characters that should be appended 
to the result of the previous operation to form the base form. 
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Such solution works very well for languages that do not use prefixes or infixes 
in their fiectional morphology, e.g. French. However, in languages like German 
and Dutch, prefixes and infixes are present in many fiectional forms. So to ac- 
commodate for this feature, we need 2 additional codes. The first one says what 
is the position from the beginning of a prefix or infix, the second code - the 
length of the prefix or infix. For languages that do not use infixes, but do use 
prefixes, it is possible to omit the position code. 

2.2 Eliminating Transition Counters 

There are two basic ways to eliminate transition counters. One uses a very clever 
sparse matrix representation (see and 0)- Apart from eliminating tran- 

sition counters, it also gives shorter recognition times, as transitions no longer 
have to be checked one by one - they are accessed directly. However, that method 
excludes the use of other compression methods, so we will not discuss it here. 

The other method (giving the same compression) is to see a state not as a set 
of transitions, but as a list of transitions (El)- We no longer have to specify the 
transition count provided that each transition has a 1-bit flag indicating that it 
is the last transition belonging to a particular state (see fig. El)- That bit can be 
stored in the same space as the pointer to the target state, along with the final 
marker. We can combine that method with others. 

r*'’(A) = \El 7 t^\A) = bytes{\S\) 
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Fig. 2. States seen as lists of transitions. S is the marker for the last transition in the 
state. 



2.3 Transition Sharing 

If we look at the figure D we can see that we have exactly the same transition 
twice in the automaton. However, once it is part of a state with 2 different 
transitions, and another time it is part of a state that has only 1 transition. As 
the information about state boundaries is not stored in the transitions belonging 
to the given state, we can share transitions between states (on the left on fig.|3). 
More precisely, a smaller state (with a smaller number of outgoing transitions) 
can be stored in a bigger one. It is also possible to place transitions of a state 
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SO that part of them falls into one different state, and the rest into another one. 
This is possible only when we keep the transition counters, so we will not discuss 
that further. 
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Fig. 3. Two transitions (number 3 and 4 from figures H and EJ occupy the same space. 
Version with counters on the left, with fists - on the right. 



In the version that uses lists of transitions, exactly one of the transitions 
belonging to a state holds information about one state boundary. The other 
boundary is defined by the pointer in transitions that lead to the state. If all 
transitions of a smaller state A are present as the last transitions of a bigger 
state B, then we can still store A inside B (on the right on fig. |3). 
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Fig. 4. The next flag with sharing of transitions. Version with counters on the left, 
with lists - on the right. N represents the next flag. 



2.4 Next Pointers 

Tomasz Kowaltowski et al. (jS|) note that most states have only one incoming 
and one outgoing transition, forming chains of states. It is natural to place such 
states one after another in the data structure. We call a state placed directly 
after the current one the next state. It has been observed in |0| that if we add a 
flag that is on when the target state is the next one, and off otherwise, then we 
do not need the pointer for transitions pointing to the next states. In case of the 
target being the next state, we still need a place for the flags and markers, but 
they take much less space (not more than one byte) than a full pointer. In case 
of the target state not being the next state, we use the full pointer. We need 
one additional bit in the pointer for the flag. Usually, we can find that space. In 
our implementation, we used the next flag only on the last transition of a state. 
Therefore, the representation of our example automaton looks like that given on 
figure 0 
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The maximum number of transitions that can use next pointers is equal to 
the number of states in the automaton minus one, i.e. the initial state. The 
reason for this is that only one transition leading to a given state may be placed 
immediately in front of it in the automaton. 

The transitions in states having more than one outgoing transition can be 
arranged in such a way that a transition leading to the next state in the au- 
tomaton may not be the last one. However, if for a given transition its source 
state has exactly one outgoing transition, and its target state has exactly one 
incoming transition, the transition must use the next pointer. 

Assuming p,q,r € Q, and a,b G U, we have: 

|{(P, a, q) : {(y (p,h,r)(^Eb = a,r = q)A = a,r = s))}| < t”P(A) 

r"P(A) < IQI, 7t^p{A) = bytes{\E\) - 1 



2.5 Tails of States 

In section O we assumed that only entire states can share the same space as 
some larger states. When using the list representation lsection f2.2l) . we can share 
only parts of states. We can have two or more states that share some but not 
all of their last transitions. 

Let us consider a more complicated example (fig. EJ. The transition number 
4 holds in fact 3 identical transitions. The states reachable from the start state 
have both 2 transitions. One of those transitions is common to both states (the 
one with label a). The second one is different (either labeled with I or t). To 
avoid confusion, we did not use the next flag. 
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Fig. 5. Automaton recognizing words pay, play, say, and stay, and sharing last transi- 
tions of states. ^ is a pointer to the tail of the state. 

To implement tail sharing we need two things: a new flag (we call it the tail 
flag, not shown on the figure 0as its value is implied), and an additional pointer 
occurring only when the tail flag is set. When the flag is set, then only the first 
transitions are kept, and the additional pointer points to the first transition of 
the tail shared with some other state. We need 1 bit for the flag, and we allocate 
a place for it in the bytes of transition pointers. 
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2.6 Changing the Order of Transitions 

In the examples we showed so far, nothing was told about the order of transitions 
in a state. Techniques of transition sharing depend on the order of transitions. 
Automata construction algorithms (e.g. |2]) may impose initial ordering, but it 
may not be the best one. Kowaltowski et al. (jOj) propose sorting the transitions 
on increasing frequency of their labels. They also propose to change the order 
in each state individually. The order of transitions also influences the number 
of states that are to be considered as next. To increase the number of next 
pointers, we try to change the order of transitions in states that do not already 
have that compression. To increase transition sharing, we put every possible set 
of n transitions (starting from the biggest n) of every state into a register, and 
then look for states and tails of states that match them. 



2.7 Other Techniques 

There are other techniques that we have not experimented with. They include 
local pointers and indirect pointers. Local pointers are only mentioned in |H|. 
We can only stipulate that what they propose is 1-byte pointers for states that 
are located close in the automaton, and full-length pointers for other states. A 
flag is needed to differentiate among the two. Indirect pointers are proposed in 
US patent 5,551,026 granted August 27, 1996 to Xerox. By putting pointers to 
frequently referenced locations into a vector of full-length pointers, and replacing 
those pointers in transitions with (short) indexes in the vector, one can gain more 
space. 

3 Experiments 

3.1 Data 

Our experiments were carried out on morphological dictionaries for German 
([3) and Dutch (P). The German morphological dictionary by Sabine Lehmann 
contains 3,977,448 entries. The automaton for the version with coded suffixes had 
307,799 states (of which 62,790 formed chains), and 436,650 transitions. The 
version with coded suffixes, prefixes, and infixes had 107,572 states (of which 
14,213 formed chains), and 176,421 transitions. 



3.2 Results 

Tabled gives the size of automata built with various options. The sg automata 
with SNTO, SNMT, and SNMTO are bigger than expected because there was 
no space for one more flag in the pointer. 
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Table 1. Size of automata built with various options, Sabine Lehmann’s German 
morphology , and CELEX Dutch morphology, in bytes. In the table, sg means Sabine 
Lehmann’s German morphology with coded suffixes, cd - CELEX Dutch, i - coded 
prefixes and infixes, O - shared transitions, S - stop bit (lists of transitions), N - next 
pointers, T - tails of states, and M - changing the order of transitions. 







O 


N 


NM 


NO 


NMO 


sg 


2,178,268 


2,117,628 


1,747,210 


1,733,730 


1,706,512 


1,694,405 


sgi 


882,123 


822,133 


731,713 


718,439 


692,514 


681,007 


cd 


3,028,893 


2,914,528 


2,369,485 


2,331,985 


2,287,664 


2,257,565 


cdi 


2,873,143 


2,758,473 


2,255,753 


2,221,047 


2,183,489 


2,147,982 




S 


SO 


SMO 


SN 


SNM 


SNO 


sg 


1,742,616 


1,720,460 


1,703,376 


1,311,558 


1,298,078 


1,300,592 


sgi 


705,700 


682,932 


668,084 


555,290 


542,016 


544,696 


cd 


2,423,116 


2,382,936 


2,350,584 


1,763,708 


1,726,208 


1,737,200 


cdi 


2,298,516 


2,257,728 


2,225,540 


1,681,126 


1,646,420 


1,663,904 




SNMO 


STO 


STMO 


SNTO 


SNMT 


SNMTO 


sg 


1,282,648 


1,703,292 


1,690,229 


1,507,949 


1,511,461 


1,495,909 


sgi 


528,814 


666,202 


654,613 


531,578 


542,016 


523,299 


cd 


1,697,742 


2,939,005 


2,904,985 


1,985,335 


1,983,531 


1,959,747 


cdi 


1,619,334 


2,779,723 


2,745,299 


1,902,887 


1,894,999 


1,876,403 



3.3 Conclusions 

In case of German morphology, we managed to compress the initial automaton 
more than fourfold. With coded infixes and prefixes, we compressed the input 
data more than 696 times. Gzip compressed the input data (with coded infixes 
and prefixes) to 16,342,908 bytes. All automata for given input data could be 
made smaller by using compression by over 40% (43.9% for Dutch). The smallest 
automaton we obtained could still be compressed with gzip by 27.77%. The best 
compression method for German turned out to be a good preparation of the in- 
put data. It gave savings from 57.66% to 65.02%. For Dutch, those savings were 
only 4.15-5.50%, as words with prefixes and infixes constituted 3.68% of data, 
and not 22.93% as in case of German. As predicted, elimination of transition 
counters gave 20% on average. The figure was higher (up to 25.13% for German, 
25.98% for Dutch) when next pointers were also used, as counters took pro- 
portionally more space in transitions. The figure was lower (16.93%) when only 
transition sharing was in use, because distributing a state over two other states 
was no longer possible. For German, next pointers gave savings from 15.77% to 
24.70%, i.e. within the predicted range (3.22% -28.26%). For Dutch: 20.84% - 
34.51%. The savings were bigger when the stop bit option was used. Surprisingly, 
transition sharing is less effective (0.84% - 3.01% on sg, and 1.91% - 7.24% on 
sgi, 0.98% - 4.45% on Dutch), and works better on sgi because sg contains many 
chains of states. Gompression of tails of states adds only 0.71% to 2.45% for 
German, and does not work for Dutch data because the additional bit for a ffag 
crosses the byte boundary. Ghanging the order of transitions gives small results 
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(up to 2.92%). We managed to speed it up so that it takes a fraction of the 
construction time. 



Acknowledgments. We express our gratitude to Sabine Lehmann for making 
her dictionary available to us. 
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Abstract. While a 2-dimensional grid picture grammar may generate 
pictures (defined as subsets of the unit square) with arbitrarily small 
details, only a finite number of them can be made visible as raster images 
for any given raster. We present an algorithm based on bottom-up tree 
automata which computes the set of all raster images of the pictures 
generated by a given grid picture grammar. 



1 Introduction 

Picture-generating devices, such as iterated function systems, chain-code pic- 
ture grammars, turtle geometry picture grammars, cellular automata, random 
context picture grammars and collage grammars, specify in general infinite pic- 
ture sequences or languages. See, for example, [PJS92, CD93, MRW82, PL90, 
EvdW99, DK99]. Although they are intended and used for modelling visible 
phenomena of various kinds, due to their infinity additional prerequisites are re- 
quired to make the generated pictures visible. Depending on the resolution of the 
chosen output medium, be it a display or a printer, the generated pictures must 
be transformed into raster images before we can have a look at them. In other 
words, a picture-generating system or grammar of the above kind specifies two 
languages. The first language contains the intended pictures, which are usually 
described as geometric objects consisting of points, lines, squares, or other such 
parts in the Euclidean space of dimension 2, 3, or higher. The second language 

* Partially supported by the EC TMR Network GETGRATS (General Theory of 
Graph Transformation Systems), the ESPRIT Basic Research Working Group APP- 
LIGRAPH (Applications of Graph Transformation), and the Deutsche Forschungs- 
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consists of the corresponding raster images for a given raster. The question arises 
how the picture language and the raster image language are related to each other. 
Clearly, one may transform a single generated picture into its raster image using 
standard techniques. With such a transformation, the raster image language — 
which is always finite — can be computed from a finite number of appropriately 
chosen pictures of the original language. However, which choice is appropriate? 
Or vice versa, given a raster image and a picture-generating device, is the image 
associated with any of the generated pictures? 

In this paper, we study the relation between picture languages and raster 
image languages for the class of grid picture grammars, which are a normal 
form of grid collage grammars [Dre96] and which coincide with random context 
picture grammars [EvdW99] if one ignores the context conditions and fixes a 
uniform grid size for all productions. Moreover, iterated function systems where 
the functions are similarities mapping the unit square onto some cell of the 
/cxfc-grid {k > 2) can be seen as special cases of grid picture grammars. As 
the main result, we show that the raster image language for each rxr-raster of 
the unit square can be constructed from the given grid picture grammar. The 
construction is based on bottom-up tree automata which get derivation trees as 
input and compute the required raster information. 

The paper is organised in the following way. In Section El 2-dimensional grid 
picture grammars and their generated picture languages, called galleries, are 
defined. The raster images of these pictures are defined in Section 0 which also 
provides our main results. In the conclusion, we discuss various generalizations 
of our considerations. In order to comply with the space restrictions, details and 
elaborated examples had to be omitted. Interested readers can find them in the 
long version [DEKKOO]. 

2 Two-Dimensional Grid Picture Grammars 

Let the unit square be divided by an evenly spaced grid into squares, for some 
k > 2. A production of a (two-dimensional) grid picture grammar consists of a 
nonterminal symbol on the left-hand side and the square grid on the right-hand 
side, each of the small squares in the grid being either black or white or 
labelled with a nonterminal. 

A derivation starts with the initial nonterminal placed in the unit square. 
Then productions are applied repeatedly until there is no nonterminal left, fi- 
nally yielding a generated picture. A production is applied by choosing a square 
containing a nonterminal A and a production with left-hand side A. The nonter- 
minal is then removed from the square and the square is subdivided into smaller 
black, white, and labelled squares according to the right-hand side of the cho- 
sen production. The set of all pictures generated in this manner constitutes the 
generated gallery. 

Example 1. For fc = 2, a production is depicted in Figure Eon the left. On the 
right, one can see a direct derivation replacing the topmost nonterminal square. 
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Fig. 1. A production of a 2 x 2-grid picture grammar and 
its application 
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7T2,2 




1^2, k 


7Tl,l 


7Ti,2 




'^l,k 



Fig. 2. Division of a 
square into sub- 
pictures 



A picture generated by a grid picture grammar can be written as an expres- 
sion in a convenient manner: Let the unit black square be represented by the 
symbol B and the corresponding white one by W. By definition, each of the re- 
maining pictures in the generated language consists, as illustrated in Figure |2I of 
k'^ subpictures . . . , 7Ti_fc, . . . , . . . , TTk,k, each scaled by the factor 1/k. 

If tij is the expression representing iTij (for i,j € [fc], where [fc] denotes the 
set {1,... then [tip,... ,tip,... represents the picture it- 

self (for k = 2 this is nothing else than a quadtree) . 

Formally speaking, any expression that represents a picture is a term (or 
tree) over a certain signature. In general, a signature is a finite set E of symbols, 
each symbol f G E being assigned a unique rank ra,nks{f) G N. The set Ts of 
terms over E is the smallest set such that for n G N, /(ti, . . . ,tn) G Ts for all 
f G E with ranki;(/) = n and all ti, . . . , tn G T^. For n = 0, we may omit the 
parentheses in /(), writing just / instead. 

Thus the terms that represent pictures in the k x fc-grid are the terms in T ^:,. , 
where Ek = {[— ],B,W}, with rank 1 ;^, ([—]) = k"^ and ranki;j^(B) = ranki;j,(W) = 
0. As a notational convention, [— ](tip, . . . ,tk,k) is written as [tip, . . . 

It is now possible to define pictures and the relationship between pic- 
tures and terms in Te, formally. A picture tt is simply a subset of the Eu- 
clidean plane, i.e., tt C intuitively, tt is the set of black points. Each pic- 
ture generated by a grid picture grammar is a subset of the unit square, i.e., 
7T C SQ = {(x,y) G I 0 < x,y < l}. The value val(t) of a term t G is the 
picture defined inductively by val(B) = SQ, val(W) = 0, and if val{tij) = nij 
for i,j G [fc], then val([fip,... ,tk,k]) = [Trip,... ,7Tfcp], where 

|7Tip, . . . , TTfcp] = (J transformfj(7Tip), 
i,j&[k] 

and transformp^ ( tt) = | (x,y) G tt} for every picture tt. 

Hence, val([fip, . . . , tk,k]) is obtained by scaling and translating each picture 
TTij = yal(tij) so that it fits into the (i, j)-th square of the fcx fc-grid. 

Example 2. The term in shown on the left of Figure 0 represents the picture 
on the right. For instance, the upper right quarter of the picture is described by 
the last line of the term. 
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t = [[B,[B,B,W,B],H/,[B,W,B,B]], 
W, 

[B,W,[B,W,B,B],[B,W,B,B]], 
[B,W, [B,W,B,B],[B,W,B,B]]] 




Fig. 3. A term t £ Ts 2 and its value val(t) 



Since there is a unique correspondence between terms and pictures, all we 
need in order to define grid picture grammars formally is an appropriate gram- 
matical device for generating terms. Such a device is the regular tree grammar. 

Definition 3. A regular tree grammar is a tuple G = {N, S, P, S) consisting of 
a finite set N of nonterminals considered to be symbols of rank 0, a signature 
S disjoint with N, a finite set P C N x T^:un of productions, and an initial 
nonterminal S € N. 

A term t € Tsvjn directly derives a term t' € Tsvjn, denoted by t — >p 
t' , if there is a production A ::= s in P such that t' is obtained from t by 
replacing an occurrence of A in t with s. The language generated by G is L{G) = 
{t GTs\ S — >*p t}, where — >*p denotes the refiexive and transitive closure of 
— >p; such a language is called a regular tree language. 



Definition 4. A grid picture grammar is a regular tree grammar of the form 
G = {N, Pk, P, S), for some k > 2. The gallery generated by G is P{G) = 
{val(t) I t G L{G)}. 

Grid picture grammars are a normal form of 2-dimensional grid collage gram- 
mars, see [Dre96, Corollary 4.5]. The only difference is that grid collage grammars 
generate grid collages, which are sets of individual squares (i.e., are sets of sets 
of points), whereas here we are only interested in the resulting pictures (i.e., sets 
of points), obtained by taking the union of all squares in a given collage. 

Strictly speaking, the formal definition of grid picture grammars is more 
general than the informal one discussed in the beginning, as it allows productions 
such as A ::= [B, [W, Ai, A 2 , W], A 3 , Bj. Intuitively, on the right-hand side of this 
production the lower right subsquare is refined one level further than the other 
three. Productions of this kind are, however, not essential, due to the following 
well-known normal- form result on regular tree grammars (see, e.g., [GS97]). 

Fact 5. For every regular tree grammar G, a regular tree grammar G' = {N, P, 
P, S) generating the same language can be constructed effectively such that P 
contains only productions of the form A ::= /(Ai, . . . , A„), where f € P, 
ranki;(/) = n, and A, Ai, . . . , A„ G N. 

In the special case of grid picture grammars this means that only productions 
of the form A ::= B, A ::= W, and A ::= [Ai^, . . . ,Ak^k] are needed, where A 
and Al l, • • • > Ak^k are nonterminals. 
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Bottom-up tree automata can be seen as the inverse of regular tree grammars. 
Instead of generating a tree from the root to the leaves, they ‘consume’ an input 
tree, starting at the leaves and working upwards. 

Definition 6. A bottom-up tree automaton is a tuple ta = (A, Q,R,Qf) consist- 
ing of an input signature E, a set Q of states which are considered to be symbols 
of rank 0, a set R of rules f{qi, . . . , <?«) — >■ <7, where f € E, ranki;(/) = n, and 
qi, ■ . ■ ,qn,q & Q, and a subset Qf of Q, the set of final states. 

For a term t € T^yjQ, t — >r t' if t' is obtained from t by replacing a 
subterm of t which is identical to the left-hand side of a rule in R with the 
right-hand side of that rule. The set Acc(to) of accepted terms is given by 
Acc(ta) = {t € I t — q for some q € Q/}- A finite bottom-up tree automa- 
ton is a bottom-up tree automaton with only finitely many states. 

Due to the following fact (see, e.g., [GS97]), finite bottom-up tree automata 
are a useful tool if one aims at computability results concerning sets generated 
by regular tree grammars. 

Fact 7. There is an algorithm which takes as input a regular tree grammar G 
and a finite bottom-up tree automaton ta, and decides whether 

L{G) n Acc(to) = 0. 

3 Computing Raster Images 

A raster, such as the one provided by the finite resolution of a computer screen, 
can be seen as an rxr'-grid that is evenly spaced in each direction. In order to 
simplify the situation, we shall consider in the following the idealised situation 
where r = r' (and the area to be displayed is the unit square). 

Given a picture tt C SQ, our aim is to determine the raster image that 
corresponds to tt. The latter can be defined as follows: For a given square in the 
r xr-grid, if any point of tt lies properly inside the square, then that square is filled 
with the colour black, else it is coloured white. The upper raster image raster“(7r) 
obtained in this way is the smallest set of raster squares which cover tt, and thus 
the least upper bound on tt with respect to the considered raster. More formally, 
let sq be SQ without its boundary, i.e., sq = {{x,y) G | 0 < x, y < l}. Then 
raster“(7r) = val([bw“ . . . , bw“,.]), where bw“^ = B if tt fl transformC^^- (sq) yf 0 
and bw“j = W otherwise. 

Analogously, one may define the lower raster image of tt as the largest raster 
image covered by tt: raster), ( tt) = val([bw)^ . . . ,bw),^]), where bw)^ = B if 
transformCj (sq) C tt and bw)^^- = W otherwise. 

For a grid picture grammar G, the set of all upper rxr-raster images ob- 
tained from pictures in T{G) is denoted by R}^{G) = {raster“(7r) | tt G T{G)}. 
Similarly, TZ\.{G) = {rasterj,(7r) | tt G T(G)} denotes the set of all lower raster 
images generated by G. 
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Example 8. The picture considered in Example fit s’ exactly into an 8 x 8-grid. 

Figure 0 illustrates how its upper raster image is computed for a 10 x 10-raster: 
whenever the interior of a raster square is not completely white, it is painted 
black. 

We shall now show that TZ'^(G) is computable for every grid picture grammar 
G and every r G N+ (where both G and r are given in the input) . A corresponding 
result for TZ\.{G) will then follow without much ado. For notational simplicity, 
let us fix an arbitrary grid size /c > 2 to be used throughout the rest of this 
section. In particular, we denote transform^^ by transform^j 

Intuitively, if we consider a derivation in G, replacing all nonterminals in 
parallel in each step, the side length of the remaining nonterminal squares is 
after n steps. After finitely many steps all nonterminal squares have become 
smaller than the raster squares, so each of them is divided by at most one raster 
line horizontally respectively vertically. As depicted in Figure 0 this results in a 
segmentation of the nonterminal square into at most four rectangular subsets. If 
one of the dividing lines lies outside the square (or on one of its edges), or both 
of them do, two or even three of the rectangles are empty and are not considered. 

Suppose that we continue the derivation. Upon termination, some of the four 
rectangles may contain points of the derived picture, turning the raster squares to 
which they belong black, while the remaining rectangles do not. Thus, situations 
such as the one shown in Figure El have to be investigated to develop the algo- 
rithm we aim at. In order to become independent of the actual size of the square 
under consideration, we magnify it to the size of the unit square SQ. Its segmen- 
tation into rectangles is uniquely determined by the positions of the correspond- 





A 







Fig. 5. Segmentation of a non- 
terminal square by raster lines 
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Fig. 6. Rectangular segments determined by di 
and d2 
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ing raster lines, two real coordinates. As described above, a pair (^1,^2) G of 
such coordinates divides the open square sq into at most four open rectangles. We 
can address these segments by the four corners (ci, C2) G {0, 1}^ of SQ by defin- 
ing recdi,d 2 (ci, C2) = {(a;i,a;2) € sq\ ci < Xi < di or di < xi < ci for I G {1,2}}, 
as illustrated in Figure 0 Now, let = |fecrfj^d 2 (ci, C2) | Ci,C2 G {0,1}}. 

Given c?i,d2 and a picture tt, we are interested in those rec G which 

intersect with tt. The corresponding subset of DIVdi,d2 i® denoted by div^^ (tt) : 
divdj,d2(7r) = {rec G | rec Gl tt 0}. 

The key to our algorithm is the lemma below. It states that diydi,d.2 i'^) can be 
determined by a bottom-up tree automaton which processes the term denoting tt. 
The automaton is finite if di, c?2 are rational numbers, because, roughly speaking, 
its states are the suffixes of the representation of di and d2 to the base k. 

Lemma 9. Let di,d2 be rational numbers and D C DIVdj_d2- One can effec- 
tively construct a finite bottom-up tree automaton tajD such that 

Acc(taD) = {t G Ti;^^ I divdi,d2(val(t)) = D}. 

Using the previous lemma, we can now prove the main theorem of this paper. 

Theorem 10. There is an algorithm which takes as input a grid picture gram- 
mar G and a number r G N+, and yields as output the set TVf{G). 

Proof sketch. Consider a grid picture grammar G = {N, P, S) in normal form 
(see Fact 0). By applying productions to the right-hand sides of productions in 
all possible ways, G can be turned into a fc^xfc^-grid picture grammar which 
generates the same language. Since this process can be repeated as often as 
necessary, it follows that we can assume without loss of generality that k > r. 

Now, let us consider TZ'f{G). Since every derivation of a term in L{G) has 
the form S — >p B, S — >p W, or S — >p [Ai^i, . . .,Ak,k] — >*p [tip, ■ ■ ■ ,tk,k], 
the following raster images (and only those) are contained in TZ'f{G): (a) SQ if 
P contains the production S ::= B, (b) 0 if P contains the production S ::= W, 
and (c) all images of the form raster}(7r), where tt = |val(tip), . . . , val(tfe,fc)] for 
some production S ::= [Ai 1, . . . , in P and derivations — >p tqyj 

(*iU 2 G [/c]). 

The raster images resulting from the first two items are easy to compute. 
Therefore, consider a production S ::= [Ai^i, . . . ,Akp], and let Pi^p^ = {val(t) | 
AiiP2 — fA G T^j,} denote the gallery Ai^^2 derives. We have to show 
how the set {raster“(|7ri^, . . . , TTfe^^]) | Tn^p^ G Pii,i2 forfi,f2 G [/c]} can be 
computed. 

Consider some ii,i 2 G [k] and let ( 61 , 62 ) be the pair of rational numbers 
representing the raster lines which cut the square (*iA2) (where Cj = 0 if there 
is no such line). Now, let (^1,^2) = transform“\^(ei, 62) and define = 

{divrfj^dj (TTij^ij) I G Pii,i2}- By Lemma0 in connection with FactQ Pqyj 

is computable, since Pqyj is generated by the grid picture grammar = 

(A^, Afc,P, andPij,i2 = {P G DIVdi,d2 I J Gl Acc(tflD) A 0}, where 

tap is as in Lemma 0 
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By the definition of and the choice of di and (I 2 , a raster image img 

has the form 

raster" (|7Ti,i, . . .,nk,kj) 

with e {ii,i 2 G [A;]) if and only if img = raster"(|L)i_i, . . . , Dk,kj) for 

some € T^iiM (*i )*2 G [fc])- Since the (finite!) sets are computable, 

these raster images are computable as well. □ 

Theorem E3 easily carries over to lower raster images. 

Theorem 11. There is an algorithm which takes as input a grid picture gram- 
mar G and a number r G N+, and yields as output the set TZ\.{G). 

Proof sketch. The picture tt = val(t) represented by a term t G T_j;^ can easily 
be inverted by turning white squares into black ones and vice versa. Thus, the 
grid picture grammar G' = {N, Ek, P' , S) obtained from G = {N, Ek, P, S) by 
exchanging B and W in all right-hand sides of productions generates the gallery 
of inverted pictures. As the lower raster image of a picture in P{G) is the inverted 
upper raster image of the inverted picture, TZ\.{G) can be obtained by computing 
72."(G') and subsequently inverting the resulting raster images. □ 

As an immediate consequence, we get the following corollary. 

Corollary 12. The following problems are decidable. 

1. Given as input a grid picture grammar G, a number r G N+, and an rxr- 
raster image img: Is img G TZf.{G) ? 

2. Given as input two grid picture grammars G, G' (possibly based on different 
grids) and a number r S N+.' Is W){G) = W){G') ? 

The analogous statements for IZ\. instead of IF) hold as well. 

4 Conclusion 

In this paper, we have presented an algorithm that computes the set of r xr-raster 
images of the pictures generated by a 2-dimensional grid picture grammar. It is 
easily seen that some restrictions are imposed for technical reasons only and can 
be relaxed without problems. Since our reasoning works for the two axes of the 
plane independently and is closed under affine transformation, the underlying 
unit square may be replaced by a rectangle, or even a parallelogram, and the 
/cxfc-grid underlying the grid picture grammar by any fcxfc'-grid, for k,k' > 2. 
Similarly, the rxr-raster may be generalised to an rxr'-raster, for r,r' > 1. 
In fact, as far as the raster is concerned, it may be worthwhile to note that 
only three of its properties are used in the proofs, namely that the raster lines 
be parallel with the axes, be determined by rational numbers, and be finite in 
number. Thus, one could use a non-uniform raster provided it does not violate 
any of these requirements. Clearly, our considerations remain also true if we deal 
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with grids in the Euclidean space of dimension d with d > 3, or if we deal with 
grammars where distinct productions need not use the same grid. 

In contrast to that, it seems much more difficult to get similar results for other 
types of picture generating devices, such as random context picture grammars, 
collage grammars, or iterated function systems, and for other modes of rewriting 
such as table-driven parallel or context-sensitive rewriting. 

Last but not least, one could investigate more complex rasterisation func- 
tions than raster” and rasterb An interesting candidate would be the one which 
blackens a raster square whenever the corresponding picture occupies at least 
half the area of the square. One could also consider using grey-scale values to 
express the percentage of black in a raster square. To study questions like these 
remains a matter of future work. 
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Abstract. We present a noncanonical extension to the Discriminating 
Reverse parsing method, which accepts non-LR grammars. In cases of 
parsing conflict, actions are deferred and a mark is virtually pushed onto 
the parsing stack. Then, locally-canonical DR parsing resumes until suf- 
ficient right context is read to resolve the initial conflict. Marks code 
coverings of the right contexts that are compatible with the actions in 
conflict. A suboptimal solution for such a coding is proposed, which is 
computed from the DR automaton itself. The stack vocabulary is en- 
larged with the mark set, but no new state is added to the basic DR 
automaton. Moreover, conflict resolution basically uses the DR parser. 

The method determines at construction time whether all the conflicts 
can be resolved, and only produces deterministic parsers. 

1 Introduction 

When designing a grammar, e.g., for a programming or specification language, 
the natural process is to reflect the logical constructions of the language. Un- 
fortunately, such a naturally designed grammar will give place to a number of 
conflicts when given to most available parser generators. But conflict resolution 
by the user is often unsafe, because it usually forces to transform the grammar 
in unintended ways, what may result in accepting a different language. Thus, 
conflict resolution and, in general, imposing grammar adaptation by the user 
should be avoided as much as possible. 

One of the reasons for such limitations in the classes of grammars that prac- 
tical parser generators accept is that they limit the lookahead length to one 
because of the combinatorial complexity from using longer lookahead windows 
— notably in parsers based on the LR method. Efforts have been done to design 
safe parser generators allowing larger classes of grammars by using unbounded 
lookahead, but either they allow to build only relatively restricted parsers PEI, 
or they accept ambiguous grammars and produce nondeterministic parsers HP 
IT20 |. But nondeterminism is unacceptable in many application areas. In this 
paper we present the basis for noncanonical extensions to the discriminating re- 
verse (DR) method in order to produce deterministic parsers for grammars and 
languages not restricted to LRR |2j. 

The DR approach to bottom-up parsing [TilllT] allows to build linear de- 
terministic DR(fc) parsers for the general class of LR(fc) grammars. It largely 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 122- ITlHl 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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avoids the state number explosion of the “direct” Knuth construction |^. Thus, 
the resulting parsers are simpler, smaller, and their construction is faster. DR(fc) 
parsers are particularly interesting as our base because, contrary to state-stack 
parsers, the decision process is local to a (typically small) symbol-stack suffix, 
what allows to naturally resume parsing after a conflict. Moreover, it is quite 
natural to compute from the DR construction the conflict right contexts to fol- 
low. 

The organisation of the paper is as follows. Section previews the basic DR(0) 
construction. Section 0 introduces NDR parsing and Section 2] presents in detail 
our proposed solution. In Section 0the parser generation and parsing algorithms 
are given, and the last section presents the conclusions. 

Notation. 

We shall follow in general the usual conventions, as those in m 

We will sometimes note grammar rules as AA-a, where i is the rule number, 
2 < i < |P| -I- 1. Grammars G{V,T, P, S) are considered to be augmented to 
G'{V', T', P' , S') with the addition of the rule S'^ h S' H, and are supposed to 
be reduced. 

i will be a distinguished symbol. Symbols in V' = V VJ {e} are noted X, and 
strings in V'* are noted a. By convention, e =>£. 



2 Construction of the DR(0) Automaton 

In order to simplify the presentation, we will use the DR(0) construction as the 
basis to our extension. 

The basic idea of the discriminating reverse approach is to decide next shift- 
reduce parsing action by means of a parsing stack suffix exploration beginning 
from the stack top. For this exploration, a DFA is built such that each state is 
associated to the set of actions which are compatible with the suffix read thus 
far. As early as next transition would go to a single-action state, that action can 
be decided without exploring deeper. 

The construction of the DR(0) automaton shown here follows that shown in 
P], with the introduction of minor changes useful to the noncan onical extension 
presented in Section 0 The construction follows the item-set model associated 
to each state of the automaton. 



2.1 DR(0) Items 

A DR(0) item [i, A— >-a./3] has the following meaning: 

— The i indicates the parsing action associated to the item (rule number, or 0 
for shift). A dot besides it permits to know whether the rightpart to reduce 
has already been completely scanned. It will not be written when its position 
has no relevance, or when its position does not change. 
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— The core part is some rule A-^a(5 with a dot that represents the current 
stack exploration point, i.e., if a is the already scanned stack suffix, then 
f3 =>* ax. 

rm 



2.2 Closure of an Item Set 

The closure of an item set Iq is the minimal set Z\q such that 

^o(Iq) = dg U {[•*, A^a»Bf3] I A-^aB(i s P' , \i, ^.7] G Ao(Iq)}. 

Since stack exploration is reverse, the rightpart dot moves in items from right 
to left, and thus the closure makes the dot ascend in the parsing tree while it is 
at the left end of rightparts. 

The closure only generates items with left-dotted action codes, in order to 
trace that rightparts to reduce have been completely scanned. 



2.3 Initial State and Transition Function on States 

Since exploration begins at the stack top, all actions are valid at initial state qo. 
We have to consider all possible valid positions of the stack top: the right end 
of rules’ rightparts for the corresponding reductions when they are about to be 
reduced, and all the possible occurrences of terminal symbols within rightparts 
when the corresponding action is to shift. 

Iqq = AQ{{[i.,A-^a.] I A^a G P'} U {[.0, P^/3.a/3'] | B^l3a(3' G P'}). 

Transition on the next (lower in the stack) symbol is simply computed by 
moving the dot one position to the leflEl 

A{Iq,X) = Z\o({[j, A— >-0.^/3] I \i^ A-^a.X»l3\ G Iq})- 

Each item set is associated to a single state, i.e., Iq> = Iq implies q' = q. It is 
useful to extend the transition function to strings in V*'. 

A{Iq,e) = Iq A{Iq,Xa) = A{A{Iq,a),X). 

2.4 Transition Function on Nodes 

The underlying directed graph on items within the DR(0) automaton item sets 
represents their relation in actual derivation trees, according to their relationship 
by transition and closure. It will allow us to follow the paths corresponding 
to these trees in order to recover the necessary context to resolve the DR(0) 
conflicts. 

^ Here applies the convention that, when no action dot is represented, its position is 
the same in both sides. 
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Thus, a node v = [z, A-^a.(i]q is associated to an item [z, A-^a.p] G Iq. The 
same item in different item sets will be different nodes. 

(Single) transitions are defined on V''. 

5{[i, A-^a.(3\q^ X) = 

{ {[.z, B^ip.Atli\q} if Of = £T and X = e 

{[i,A-^a'.X(3\q, I Iq, = A{Iq,X)} iia = a'X 
0 otherwise, 

where the special symbol e is used for closure-related transitions. We extend this 
function to strings in V'*: 

5{v^e) = {u} S{i^,aX)= S{iy',a). 

GS(iy,X) 

Transition sequences on e* correspond to the closure on item sets. 

2.5 Conflict-Root Nodes 

Let be the set of DR(0) conflict nodes, for which the left contexts indicated by 
the nodes (i.e., the leftpart nonterminal plus the rightpart portion to the left of 
the dot) are explicitly compatible with more than one parsing action, excluding 
the nodes which are found before a rightpart can be fully verified. The subset 
contains the conflict-root nodes, i.e., the “initial” nodes of , where conflicts 
are found. Hence, 

= {v \ 3p, V G 5([z, A-^aX.a']q, /5), 3[z, A-^aX.a']q, [j, A-^aX.a"]q, 
i yf j, $[k, B^pX.(3']q, Bj3 yf Aa, ^ [/i., A-^aiW X .a]q\ 

= {v I 3z^' Y) n 

Note that the definition implies a' yf e. 

3 Introduction to Noncanonical DR(0) Parsing 

In the NDR extension, in case of parsing conflict, actions are deferred, and a 
conflict-root mark is (virtually) pushed onto the parsing stack and a new shift 
is made. Then, the parser can resume its work as a normal DR parser (shifting, 
and reducing stack suffixes above the mark), until the mark position is reached 
during some stack suffix exploration. At that point, there are several possibilities, 
depending on the mark and the state in which it is found: 

1. The conflict can be resolved, because the suffix leading to the state in which 
the mark is found is compatible with the right contexts associated with 
only one of the actions in conflict for this mark. At this point, the parser 
resumes using the conflict-root mark position as its new (virtual) stack top 
and performs the action. 
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2. The context associated to the mark allows to decide an action. It is performed 
(if it is not a reduction including the mark position) and normal parsing 
continues. 

3. The parser cannot continue because the mark does not allow to decide be- 
tween several actions, or because the reduction spans to the mark position. 
An induced mark is pushed, and DR parsing resumes as previously. 

4. The mark is incompatible with the state (because the input is not a sentence). 
Therefore, an error is signalled. 

Essential information regarding conflict contexts is computed at construction 
time. It is represented by graphs which can be deduced from the transitions 
from the initial state until the state in which the conflict is found. A graph is 
associated to each action in a DR conflict, and a mark is, at construction time, 
a set of graph nodes coding some covering of their corresponding right contexts 
to follow. 

The critical question are the induced conflicts of point 01 above. An induced 
conflict can involve derivation subtrees, and the construction must leave some 
paths in the graph of the previous mark to follow the induced paths, and then 
resume the previous paths. In order to have optimal parsing power, a new graph 
must be connected to the node representing the corresponding mark position. 
Since this may happen an unbounded number of times, such a construction 
is, in principle, unfeasible, unless some form of looping construction is devised 
preserving the difference amongst the possible paths. 



3.1 NDR(O) Parsing for an Example Non-LRR Language 

In order to illustrate NDR(O) parsing, here we show by example the acceptation 
of a non-LRR (hence nondeterministic) language. Consider the grammar G\ with 
productions 



P[ = { h 5 H, 5*4 AC, sAbCc, a4 aAb, a4 d, 

B-^aBbb, B^d, c4aCc, C4e} 

for the non- LRB0 language Li = {a^db^aPecP} U {a^db'^^aPecP+^}, n,p > 0. 

The NDR(O) parsing table for the example grammar is shown in Fig.^ where 
“gi” represents transition entries, “i” represents a parsing action (rule number 
to reduce, or shift if i = 0), “wi” indicates to push that mark, and “ri” indicates 
to resolve rightmost DR conflict by applying action i. 

Note that mark situations are computed at construction time, but only mark 
themselves are used in coded form from the parsing table, as any other stack 
symbol. 

Let us see, for instance, how parsing proceeds for a sentence in the first sub- 
language of Li with an even number of b's (for an easier comprehension, marks 

^ Because a regular cover to always discriminate between suffix languages 6"a^ec^ 
and is impossible, for it would be necessary to compare the (unbounded) 

number of a’s and c’s. 
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Fig. 1. NDR(O) parsing table for grammar Gi. 



are shown as if they were actually inserted within the working sentential form) . 
1= and I represent a parsing step and the working top of stack, respectively. 

h H h d\b^'^ ed’ H |= h H 

1=^ h a^”dmi(6m2&mi)"ala^“^ec^ H h a^”dmi(6m2t»mi)"a^elc^ H 
1= h a^"dmi(6m2&mi)"a^ClcP H |^l- a^"iimi(&m2&mi)"a^“^C'lcP“^ H 
h a^”dmi(6m2&mi)"C'l H |= h a^”dmi(5m26TOi)”Cm2 H I 
1= h a^"Al mi( 6 m 26 TOi)"Cm 2 H \= h a^"A6l m 26 mi( 6 m 2 foTOi)"“^Cm 2 H 
1= h m26mi(6m25TOi)"“^(7TO2 H [=^ h Al miCm2 H 

1= LAC I m2 H 1=^ h S-H I. 

The parser initially performs a sequence of shifts until finding d, whose 
possible reductions to A or _B are in conflict, and thus conflict-root mark mi 
(i.e., the mark to resolve) is pushed and next terminal b is shifted. The following 
sequence of b’s is shifted, with alternating marks m 2 and mi coding its parity. 
Then d^ed is reducec0 to C without ever looking at the rightmost mark mi. 
Then a new induced mark m 2 is pushecfl and H symbol shifted, which allows to 
resolve the conflict-root mark in favor of reduction to After reducing d to 
A, basically DR(0) parsing continues according to the new “stack top” position, 
reducing a^^db^"^ to A. Then, a shift action puts C on top of the stack, and 
after some more steps parsing successfully ends. 

The reader can easily check that the number of parsing actions and of stack 
explorations, including those regarding the conflict, is linear on the input length. 



^ This is the critical “counting” phase that is impossible for a regular covering. Note 
that there is neither special forward exploration nor backtracking, just natural 
locally-canonical DR(0) parsing. 

^ In case of an odd number of fe’s, the rightmost mark would be m 2 , which would 
already resolve the reduction to A without shifting H. 

® Or, alternatively, to B if a c had been read instead of H, for some sentential form 
h aAdh^'^Cc H; note that the parser did not “count” the number of a’s. 
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4 The Basic Looping Construction 

In ^ a nonlooping solution is described, which directly uses the DR automaton 
underlying graph of items. We present here a more powerful, looping construction 
that computes more precise right contexts. Nevertheless, it is suboptimal since, 
in some cases, it merges paths of a same graph. 

For this looping solution, contexts are coded by mirror copies of subgraphs 
of the DR automaton underlying graph. The nodes of these copies will be called 
right nodes, while DR automaton underlying graph nodes will be called left 
nodes. A distinct graph is built for each DR conflict. Each such DR conflict and 
graph can be uniquely identified by the pair (q,X), i.e., the DR conflict found 
at state q on symbol X. 



4.1 Right Nodes and Transitions 



A right node p = [j, A-^a.p,Va,vt]qX belongs to the right context graph for 
action j in conflict {q, X). The core A^a»(3 represents a potential mark situation 
where the dot indicates the stack top position at the moment of pushing a mark. 
Left node Va is the reference node of p. Transitions for right nodes are guided 
by transition paths from Va towards left node vt- Accordingly, we define the 
following right-node transition function: 



witl0 



A^a.P, Va, Vt]qX, T) = 

{ 0([j, A^a.p, Va, vt]qx) if /3 = £ and y = e 
{[j, A^aY.(3', Va, vt]qx} A l3 = Yf3' 

0 otherwise. 



( {[j,B^l^-y,l''a,’^t]qX I 

I '^'a= [i,B^-1.Ai]q G 5{Va,eai), 3p,VtG S{v'a,ep)} AVay^V 
{[j, B^jA.y ,v,v]qx \ B^jAy £ P'} otherwise. 

In right nodes, the dot moves from left to right, and when the right end of the 
rightpart is reached, it ascends in the parsing tree. Note the symmetry between 
the dot ascents in left and right context graphs: 

[j,A^aB.f3,v'a,vt\qx G 0{[j,B^ip.,Va,vt]qx,£) implies 
= [i, A-^a.BP]q> G S{[i,B^.(p]q>,i). 

Transitions (on i) from right nodes with a reference node Va such that vt G 
S{va,iai) will be described in next sections. 



V is a distinguished virtual left node which indicates that all legal paths allowed by 
the grammar can be followed. 
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We extend 6 to strings in V'*: 

= 9{fj,,Xa)= y 0{fj,',a). 

4.2 Equivalence and Connection of Right Subgraphs 

All right graphs such that their nodes have the same vt are mirror copies of the 
same section of the left graph defined by transitions from nodes in Iq^ towards 
Vf- Hence, we use here a simple equivalence condition by considering that two 
right subgraphs are equivalent iff their nodes have the same {j,vt,q, X). 

The following procedure connects the set of “target” nodes /x of a right sub- 
graph associated with some (j, r-t, g, X) to a of the right graph of the conflict 
(q,X) for action j. 

procedure connect (iX(, /Xp) = 
for all /X = [j, H— such that 
Up = [j, A-^aB.f3, Va, i't]qx, 

= [i,A-^a.BI3]q^ G 5{v'^,£ipi),v'^ = [i, B^ip-i.ip2]q^ 
add /ip to 0(/x, e) 

Since two equivalent right subgraphs related to different left contexts are the 
same mirror copy of some left subgraph, their corresponding right contexts will 
be merged, and thus confused, because they share transitions built by connect. 
On the other hand, this property ensures looping on a right subgraph whose right 
context would infinitely reintroduce new mirror copies of the same subgraph in 
an unrestricted construction. 

4.3 £-Skip Function on Right Context Node Sets 

We note by .Jm the right node set for a mark m. The e-skip function Oq on mark 
node sets is defined as follows. 

^o(^m) — IfX — [//, A yoX »(X , ZXq , qX G i?(/i,/x), /X G Jrm 
/5 =>■* e,3[Q,A-^aY.a'\qq\ 

U {Mo I fxo = [j,C^-^YX ,vo,v[]qx,VQ = [0,C-)>7r.7']go, 

= [0, A-)>a.a']qj,/Xc = [j, A-^a.a' ,Ua,^'t]qX G 6»(/x,p), 

H G Jm, G PP' =^* e}- 

The objective of this function is to skip the nonreduced (because of the 
conflict giving place to the mark) nonterminals which can derive e that the DR 
automaton cannot evidently And during its exploration of some working-stack 
suffix. Its first subset simply ascends on the right through 6, until a shift is 
possible; no connect is necessary in this case, and 0 q simply follows 9. 

In its second subset, a connect is necessary, because it corresponds to the 
case where the DR automaton ascends in the parsing tree by 6 through the 
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(nonexistent) stack suffix deriving e, and finally deciding also a shift action — 
so, in all cases, the next parsing action after pushing a mark is to shift. In 
order to allow to resume the paths on the previous subgraphs, connections are 
performed (for each Hc) by calls to connect (v'f, /ip), such that {fJ,p} = 

Note that it is possible to use a more precise equivalence condition which 
distinguishes whether a subgraph is associated with a closure, since they have 
in general a more restricted left context (deriving e). 

4.4 Transitions on a Mark 

The computation of the induced conflict mark situations has the form of a tran- 
sition function on the state q' finding the mark, as follows. 

6{Jm,q') = 

^o({M I M [j) ^ ^ 

U{mo I Mo = [j,D^ipY.ip\vQ,v[]qx,vo = 

t^c — [j? ^ ^ l/i^qX € Jm^ 

3p, [i,A-^a.fiB-f\q, e 5{v[,(3) C 5{v(^,ep)}) 

The first subset simply follows 9 when the mark node is found during the 
exploration of a rightpart section from the initial state. On the other hand, 
for the second subset, as in 0 q, when the DR automaton ascends in the trees 
through 6, “context switches” are performed, and the subgraphs are connected 
by calls to connect{v[, pp) , such that {mpI = 9{pc,PB). 

Each different node set is associated to a different mark, and thus <7^ = Jm' 
implies m = m' . 

In general, for some mark m, not every possible mark m! will be produced 
at parsing time for every q such that Jm' = 0{Jm,q) 0- Such marks are 

useless, and their inclusion in the final parser could be avoided with a more 
precise generation procedure. Note that a mark set Jm' = 0{Jm,q) is useless if 
some ruh is always found at parsing time before reaching m at q, and thus m' 
is never pushed. But, because of the equivalence condition, the context of Jm' is 
never less precise than the context of any of such Jm^ - Therefore, the inclusion 
of useless marks does not affect the grammar class that the parser generator 
accepts. 

4.5 DR(0)-Conflict Mark Node Sets 

Let be the conflict-root mark associated with some DR(0) conflict (q,X). 
Its associated set of right nodes is the following: 

= 'B{JqX,q) 

(what involves the calls to connect associated to 0) where Jqx is the set asso- 
ciated to a virtual mark and is defined as follows: 

JqX = {[j,A^aX.(3,t),C']qx I [j, A^aX./3]q G Iq}- 
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From the definition of Iq , the left contexts compatible with the actions in the 
conflict are the same. Thus, the graphs for actions in conflict are connected to 
a virtual right graph whose transitions follow all the upwards paths allowed by 
the grammar. 

4.6 Inadequacy Condition 

A grammar G is inadequate iff 

, A-^a' •j3,v'g^,v[\qX G Jm, 
j ^ /. («"a. ^t) G Vt)}- 

Since, from the node with v, all legal paths allowed by the grammar can be 
followed, there is a path in each respective right graph of each action which 
follows the same transitions. Consequently, if the condition holds, there are some 
sentences for which the parser cannot discriminate between both such actions. 

5 Algorithms 

Here we present the NDR(O) parser-generation and parsing algorithms. The for- 
mer produces the deterministic parsing table for the acceptable class of gram- 
mars that the latter uses during parsing. If no mark is computed, the grammar 
is LR(0) and the resulting table corresponds to a DR(0) parser, which may be 
considered a particular case of NDR(O). 



5.1 NDR(O) Parser Generation 

The generation algorithm iteratively generates automaton states and conflict 
marks. New states are created while an action cannot be discriminated, or if 
rightparts to reduce have not yet been fully explored. When conflict items are 
detected the corresponding conflict-root mark is pushed. In the mark section, 
actions to perform at the corresponding state when finding a mark are computed. 

Inadequate grammars are detected according to the condition of previous 
section (not shown). If the grammar is adequate, a deterministic parsing table 
Action is generated. 

NDR(O) GENERATOR: 

Q := {qo}; M := 0 

repeat 
for q G Q do 
for X gV do 

QS := {{Aa,i) \ 3[i, A-G-aX.a']q\ 
if QS = 0 then Action{q, X) := error 
else if 3i' ,y{Aa,i) G QS,i = i' and $[i., C^jWX.j']q then 
Action{q, X) := i 
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else if 3{Aa, i), (Bf3,j) G QS, Aa ^ B/3 or 3[i., then 

Iq' := A{Iq,X); add q' to Q; Action{q, X) := goto q' 
else Jm = J qx\ add Jm to M: Actioniq.X) := push m 
until no new q 
repeat 

for q' gQ - {qo} do 

for m G M do 

MS := {{j,i) I [j,A^a.f3,iya,iyt]qX G A^a.p]q>} 
if MS = 0 then Action{q',m) := error 

else if 3j'y{j,i) G MS,j = j' then Action{q',m) := resolve j' 
else if 3i',V(j, i) G MS,i = i' and $ [i'., B^ipX.^p]qi then 
Action{q' , m) := i' 

else Jm' := 0{Jm,q')', add m' to M; Action{q',m) := push m' 
until no new m 



5.2 The NDR(O) Parser 

In the following algorithm, a working list I contains the current symbol sequence 
deriving the input prefix read thus far. At any time, list positions can be indexed 
from 1 to the current length e. Procedure ins(A, p) inserts symbol X in position 
p in the list, and del(/i, p) removes symbols at positions p, . . . ,p + h — 1. The 
parser uses pointer p to explore down the list from current “top” t. 

Since sequences of simultaneous unresolved DR conflicts are resolved in the 
reverse order, we use a mark stack s. It contains one element per DR-conflict- 
root mark, with three components: the conflict-root mark position (and thus 
next resolution top) r, its corresponding rightmost mark position p — initially 
equal to r — , and rightmost mark value m. The topmost element, indexed by c, 
corresponds to the rightmost, current DR-conflict. 

NDR(O) PARSER: 
procedure shred(i) = case i of 

shift : ii t = e then read(a); ins(a, e -I- 1) 
t := t -f 1 

reduce{A^a) : t := t — \a\ + 1; del(|a|, t); ins(A, t) 
c := 1; s[c] :=(0, 0, 0) 
ins(h, e -I- 1); t := e 

repeat 

q ■= go; P := t 

repeat 

if s[c].p ^ p then act := Action{q, l[p\) 
else act := Action{q, s[c].m) 
case act of 

goto q': q := q'; p := p - 1 
shift, reduce : shred(act) 

push m' : if s[c].p yf p then c := c-l- 1; s[c].r := t 
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s[c].p := t; s[c].m := m'; shxed(shift) 
resolve i : t := s[c].r; c := c — 1; shred(t) 
until act ^ goto 
until act G {accept, error} 

6 Conclusions 

The basic construction of noncanonical discriminating-reverse parsers has been 
presented. It uses a mechanism of marks allowing to resume locally canonical 
DR parsing, while being able to resolve conflicts when sufficient right-hand con- 
text has been processed. This model of extension permits in general to largely 
increase the parsing power to non-LR(fc) grammars while producing determin- 
istic parsers, and it will thus be of applicability to problems where ambiguity or 
nondeterminism during parsing is hardly acceptable, as the area of programming 
language processing. 

We conjecture that the optimal, finite construction without enlarging the 
basic DR(0) automaton will allow to parse at least every LALR(fc) grammar, 
for any k, (since in those grammars, the left context information is condensed 
in a basically direct LR(0) parser, and the corresponding lookahead window is 
correlated to those states) as well as some class of grammars of infinite lookahead. 
Such an optimal solution is under study. Moreover, we have shown that, since the 
NDR parser retains all its context-free parsing power during the mark resolution 
phase, it is possible to accept non-LRR grammars and languages. 

The likely straightforward extension to using DR(fc) as the base automaton 
would evidently accept all LR(/c) grammars with a relatively small increase in 
the number of states. Moreover, it would in general differentiate more left context 
in the automaton nodes, and thus the class of acceptable grammars would be 
further enlarged. 
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Abstract. In this paper, we give an automata theoretic version of sev- 
eral algorithms dealing with profinite topologies. The profinite topology 
was first introduced for the free group by M. Hall, Jr. and by Reutenauer 
for the free monoid. It is the initial topology defined by all the monoid 
morphisms from the free monoid into a discrete finite group. For a vari- 
ety of finite groups V, the pro-V topology is defined in the same way by 
replacing “group” by “group in V” in the definition. Recently, by a geo- 
metric approach, Steinberg developed an efficient algorithm to compute 
the closure, for some pro-V topologies (including the profinite one), of 
a rational language given by a finite automaton. In this paper we show 
that these algorithms can be obtained by an automata theoretic approach 
by using a result of Pin and Reutenauer. We also analyze precisely the 
complexity of these algorithms. 



1 Preliminaries 

For more informations on automata and languages theory we refer the reader to 
P17| . For a general reference on the theory of groups, the reader is referred to 
ini . and for basic results and definitions on graphs see |^. 



1.1 Introduction 

The profinite topology is used to characterize certain classes of rational lan- 
guages: the languages of level 1/2 in the group hierarchy and the languages 
recognizable by reversible automata |llll4j . Moreover pro-V topologies on the 
free group or on the free monoid play a crucial role in the theory of finite semi- 
groups Enm. In particular, several important decidability problems, related 
to the Malcev product, reduce to the computation of the closure of a rational 
language in the profinite topology. 

Pin and Reutenauer and Ribes and Zalesskii \nm developed an 

algorithm to compute the closure of a rational language given by a rational 
expression. Recently, by a geometric approach, Steinberg m proposed an al- 
gorithm to compute the closure of a rational language given by a finite state 
automaton. 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 1.85- IT3fl 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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This paper provides automata theoretic arguments to derive Steinberg’s al- 
gorithm from Pin and Reutenauer’s. Furthermore, we study several algorithmic 
consequences of this approach. 

In the first part of this paper, we introduce some notations and definitions. In 
the second one, we give an algorithm to compute the subgroup of the free group 
generated by a rational language and we obtain an automata theoretic proof of 
Steinberg’s algorithm. In the third part, we apply this algorithm to languages 
of the free monoid, and we show that testing whether a word belongs to the 
interior of a given rational language can be done in polynomial time. We also 
give a polynomial time algorithm to test whether a language is closed. In the 
fourth part we do a precise analysis of the algorithm in the profinite case. We 
deduce a new algorithm to test whether a deterministic automaton recognizes 
a closed language in the profinite topology (see j l 411 6j ). Finally we give a new 
proof of a proposition on reversible automata (Proposition I I l)j). and we also give 
a new characterization of closed rational subsets of the free group (Proposition 
ITTji . 

1.2 Languages and Words 

We denote by \K\ the cardinality of a set K. If is a function from a set X into 
a set Y and x € X, we denote by Xip the image of x by p. 

Let L be a language of A*, we denote by the complement of L in A*. 

Let A be a finite alphabet and let A = {a | a G A} be a copy of A. Finally, 
let A be the disjoint union of A and A. The map o i— >■ a from A onto A can be 
extended to a one-to-one function from A into itself by setting a = a. A word 
of A* is said to be reduced if it does not contain any factor of the form aa with 
a G A. We denote by D{A) the set of all reduced words of A* and by = the 
monoid congruence generated by the relations aa = 1 for all a G A. We write 
F{A) for the set A/=, which is a group for the quotient law, called the free group 
over A. Let tt be the projection from A into F{A), which is a monoid morphism. 
Note that the restriction of tt to D{A) is one-to-one. For each element u G A 
there exists one and only one reduced word v = red(w) such that utt = vtt. 
Since every word of A* is reduced we can now identify u G A* with utt, thereby 
considering A* as a subset of F{A). 

1.3 Automata 

Recall that a finite automaton is a 5-tuple A = {Q, R, E, I, F) where Q is a finite 
set of states, B is the alphabet, EQQxBxQ is the set of edges (or transitions), 
/ C Q is the set of initial states and F Q Q is the set of final states. A finite 
automaton A = (Q, B, E, I, F) is said to be a (n, m)-automaton if |(5| = n and 
\E\ = m. The following relation always holds for an (n, m)-automaton 

m < |A|n^ and hence m = 0{n^) 

A path in A is a finite sequence of consecutive edges: 

P= {qo,ao,qi),{qi,ai,q 2 ), - ■ ■ , ( 9 „_i, a„, 
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The label of the path p, denoted by [p], is the word ai02 • • • a„, its origin is go 
and its end is gn. A word is accepted by A if it is the label of a path in A having 
its origin in I and its end in F . Such a path is said to be suecessful. The set of 
words accepted by A is denoted by L{A). 

For every state g and language K we denote by g.^AT (or q.K if there is no 
ambiguity on A), the subset of Q of all the states which are the end of a path 
having its origin in g and its label in K. An automaton is said to be trim if for 
each state g there exists a path from an initial state to g and a path from g 
to a final state. It is called complete if for each state p and each letter a of the 
alphabet there exists a transition which starts from p and which is labeled by 
a. An automaton is deterministic if it has a unique initial state and does not 
contain any pair of edges of the form (g, a, gi) and (g, a, q2) with gi ^ q2- Let us 
note that for a deterministic (n, m)-automaton we have 

m < \A\n and hence m = 0 {n) 

An important result of automata theory states that for any automaton A there 
exists exactly one deterministic automaton (up to isomorphism) with a minimal 
number of states which accepts the same language. It is called the minimal 
automaton of L{A). An automaton is connected if any two states can be joined 
by a path. An e-automaton is an automaton in which the transitions are in 
Q X (AU {e}) X Q. 

A subset L of F(A) is said to be accepted by an automaton A on A if 
L{A)tt = L. 

An automaton on A is dual if for every transition (p, a, g), the triplet (g, a, p) 
also is a transition. An inverse automaton is a connected dual automaton on A 
in which each letter induce a partial one-to-one function from the set of states 
into itself. We have unEH: 

Proposition 1. Let H be a subset of F{A). The following propositions are 
equivalent: 

(i) H is a finitely generated subgroup of F (A). 

(ii) There exists an inverse automaton A on A with a unique initial state which 
also is the unique final state such that L{A)tt = H . 

(Hi) There exists a dual automaton A on A with a unique initial state which also 
is the unique final state such that L(A)tt = H . 

We define a subgroup automaton as an inverse automaton with a unique initial 
state which also is the unique final one. The following result (dlEII) describes 
an algorithm called ReduceToInverse to compute a subgroup automaton form 
a dual automaton. 

Proposition 2. Let A = {Q , A, E , {i} , {i}) be a dual (n,m) -automaton. We 
can construct in time 0 {mn -\- n^) a subgroup automaton Aq on A such that 
L{A)tt = L(Ao) 7 t. Moreover Ao has no more states and transitions than A. 
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1.4 Automata and Graphs 

Let A be a (n, m)-automaton with set of states Q and set of transitions E. A 
subset P of Q is said to be strongly connected if, for each pair p and q of states in 
P, there exist a path from p to q and a path from q to p. A strongly connected 
component of A is a maximal (for the inclusion) set of states which is strongly 
connected. The strongly connected components of A form a partition of Q. A 
transition (p, a, q) of A is internal to a strongly connected component if p and 
q belongs to the same strongly connected component. It is said internal if it is 
internal to some strongly connected component and external otherwise. One can 
compute in time 0{m + n) (see |S|) the strongly connected components of A. 

A spanning tree of a (n, m)-automaton A = (Q, B, if , J, F) is a spanning 
tree of the graph {E, Q) (see [S| for the definition of a spanning tree). It can be 
computed in time 0(rn + n). Given a (n, m)-automaton, one can test whether a 
given word belongs to L{A) in time 0((m + n,)|u|). On the other hand, testing 
whether L{A) is empty only requires time 0(m + n). 

1.5 Rational Languages 

The class of rational languages of M (where M is a monoid) is the smallest class 
of languages closed under product, finite union and star operation. It is well 
known that a language of A* is rational if and only if it can be accepted by a 
finite automaton. 

An analogous result holds in the free group, by a result of Benois | 2 ]. 

Theorem 1. A subset of F{A) is rational if and only if it can be accepted by a 
finite automaton. 

1.6 Pro-V Topologies 

A variety of finite groups is a class of finite groups closed under taking subgroups, 
homomorphic images and finite direct products. Important examples are G, the 
variety of all finite groups; Gp, the variety of all finite p-groups (where p is a 
prime number); GnU and Gsop the varieties respectively of all finite nilpotent 
groups and all finite solvable groups; Gcom, the variety of all commutative finite 
groups. 

We say that two elements x and y of F{A) can be separated by a group V if 
there exists a group morphism <p : E(A) — > V such that xp ^ yp. 

If for each group G and each normal subgroup H of G such that H G 'V 
and G / El G V, we have G S V, then we say that V is extension closed. This 
is the case if V = G, Gp or Gsoi, but not if V = Gcom nor Gnu. If V is 
extension closed and non trivial, then every pair of distinct elements of F{A) 
can be separated by an element of V. Thus we define 

r'v{x,y) = min{|y| | F G V, V separates x and y} 



dv(cr,p) = 2-’-v(-'^^) 



and 
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One can verify that d^/ is a distance on F(A) which defines the pro-V topol- 
ogy on F{A). If V = G we say profinite instead of pro-G. If AT is a subset of 
F{A) we denote by Clv{K) the closure of K for the pro-V topology. 

Ribes and Zalesskii proved that if V is extension closed, we have 

Proposition 3. liVliblibI One can compute the closure of a rational subset of 
F{A) by the following algorithm: 

(1) Clv{S) = S if S is finite. 

(2) ClviSi U 52) = ClviSi) U Clv{S 2 ). 

(3) Clv{SiS2) = Clv{Si)Clv{S2)- 

(4) Clv{S*) = Clv{{S)), where (S) is the subgroup of F{A) generated by S. 

Moreover if is a finitely generated subgroup of F{A) then Cl\-{H) also is a 
finitely generated subgroup of F{A). 

We also can define the pro-V topology on A* as the trace on A* of the pro-V 
topology on F{A) (see [11]). So, to compute the closure in A* of a rational set 
of A* we just have to compute it in F{A) and intersect the result with A*. 

In this paper we only consider extension closed varieties of groups. More- 
over we assume that if if is a finitely generated subgroup of F{A) given by 
a subgroup automaton A = {Q, A, E, {i}, {i}), we can compute a subgroup 
(n, m)-automaton Ao such that Cl\-{F[) = L(Mo)7t. We call this algorithm Clo- 
sureGroup and denote its time complexity by f{n,m). Let us remark that a 
subgroup (n, m)-automaton satisfies m < j^jn, and thus one can write f{n) for 
f(n,m). Such algorithms are known for the profinite and pro-p topologies (see 
m)- Moreover we naturally assume that / is an increasing function of n and 
that /(m -k U 2 ) > /(ni) -k f{n 2 ). 

2 Main Algorithms 

In this section we adapt the algorithm of Proposition 0 to rational languages 
given by an automaton. We first treat the case of a group generated by a rational 
language. 

Proposition 4. Let L be a non empty rational language on A given by a trim 
(n, m)-automaton A = {Q, A, E, I, F). Let 

Eo = EU {{q,d,p) I (p,a,g) G E}U{{p,e,q) \ p,q€ FUL} 

Then the e-automaton Ao = (Q, A,EqA U F, F U L) satisfies 

(L(M)7t) = L{Aq)tt 



Proposition 5. Let L be a rational language on A given by a trim (n,m)~ 
automaton A = {Q, A, E,!, F). One can compute in time 0{mn n^) an au- 
tomaton Ao on A such that {L{A)tt) = L{Ao)t^ and such that Ao is a subgroup 
automaton. 
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Let us remark that thanks to the above result one can compute in polynomial 
time a finite set V which generates the subgroup generated by L(A)7t (see EH). It 
also is possible to compute such a set F if L is presented by a rational expression, 
as it was presented in m 

We deduce from this a method to compute the closure of a language given 
by a strongly connected automaton. Let us first recall the following proposition 
(see for example HH): 

Proposition 6. Let A = {Q, A, E, I, F) be a trim dual automaton and for each 
q G Q, let Hq = L{A'^)'k . The following propositions hold: 

(1) Hq is a finitely generated subgroup of F{ A). 

(2) If p G q.u and x = un then Hq = xHpX. 

(3) For each i G I and f G F, let utj be a word on A* such that f G i-Uij. 
Then: 



L= IJ UijTTHf = IJ Hi{uijTr) ( 1 ) 

ieijeF ieiJeF 



Now we claim 

Proposition 7. Let A be a strongly connected- automaton on A with a unique 
initial state i and a unique final state f. Let H = L{A’‘)tt and L = L(A)tt. For 
all u and v in A* such that f G i.u and i G f.v, we have 

{H) = {Livn)) = {L{un)) (2) 

and 

Clv{L) = Clv{H)y = Clv{H)x = Clv{{H))x = Clv{{H))y (3) 

Propositions. Let A = {Q,A,E,I,F) be a strongly connected 

automaton. One can compute in time 0{f{n) + mn + n^) an au- 
tomaton Ao on A such that L{Aci)f = Clv{L{A)'n) 

Now we give an algorithm, called ClosureInF, to compute the closure of a 
rational language given by an automaton. 

Theorem 2. Let A = {Q, A^ E, I , F) be an automaton. One can compute in 
time 0{f{n)-\-mn-\-n^) an automaton Aq on A such that L{Aq)f = CIv{L{A))tt . 
We call this algorithm ClosureInF. 

In particular pro-p and profinite closures of a rational set can be computed 
without computing a rational expression of it. 
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3 Applications to Languages of A* 

We use the previous algorithms to compute the closure of a rational set of A* 
given by an automaton. We first need the following results. 

Theorem 3. m Let A he a {n,m)- automaton on A. One can compute in time 
0{n^) an automaton Ao on A with n states such that L{A)n = L(Ao)tt and if 
u G L{A) then red{u) G L{A). 

By retaining only the edges labeled by letters of A we get an important algorithm, 
called ProjectionInA*. 

Corollary 1. 0 Let A be a automaton on A. One can compute in time 

0(n^) an n-state automaton which recognizes L{A)tt C\ A* . 

We can now sketch the description of our closure algorithm, called Clo- 
surelnA*. 

Theorem 4. Let A = (Q, A, E, /, F) be an (n, m) -automaton. One can compute 
in time 0(/(n)^) an automaton Ao on A which recognize the closure of L(A) 
for the pro- V topology on A* . 

If A — (Q, A, if, {*}, F) is a complete deterministic automaton then L{A)‘^ 
also is a rational language accepted by the complete deterministic automaton 
(Q, A, E, {i}, Q\F). Thus it is easy to take the complement of a rational language 
of A* given by a deterministic automaton. From this we deduce: 

Proposition 9. Let A = (Q, A, F, {*}, F) be a complete deterministic automa- 
ton and u a word on A. One can test in time 0(/(n)^ + f(n)\u\) if u belongs to 
the interior of L{A) for the pro-V topology on A*. 

We again use the ease of taking the complement of a rational language of A* 
given by a deterministic automaton. It is also easy to compute the intersection 
of two languages given by automata |7j- Thanks to this observation we have: 

Theorems. Let A = (Q, A, F, {i}, F) be a deterministic complete 

(n,m)- automaton. One can test in time 0{n^ + (n + m)f{n)) whether L{A) 
is closed for the pro-V topology on A*. 

4 The Profinite Case 

In the case of V = G we obtain the profinite topology and the algorithms become 
much easier. Indeed we can compute the closure of a language recognized by a 
strongly connected automaton just by taking its dual. 

Theorem 6. Let A = (Q,A,F,/, F) be an {n^m)- automaton on A. One can 
compute an automaton Ao such that F(Ao)7t = CIc{L(A)tt) in time 0{m + n). 
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We can also adapt it, as in the general case, to produce an algorithm to 
compute the profinite closure in A*. 

Theorem 7. Let A = {Q,A^E,I,F) be a {n,m)- automaton on A. One ean 
eompute in time 0{n^) an automaton Ao on A with n states which recognizes 
the closure of L{A) for the profinite topology on A*. 

We deduce from this a new algorithm to test whether a rational language of 
A* is closed for the profinite topology on A*. As for Theorem 0 we have: 

Theorem 8. LetA= {Q, A^ E, I , F) be a deterministic {n,m) -automaton. One 
can test in time O(n^) whether L{A) is closed for the profinite topology on A* . 

In 1 1 41 1 ti] , the algorithm to test whether a language is closed for the profinite 
topology is obtained by checking a property of the ordered syntactic monoid. In 
practice it requires to compute the transitive closure of a graph with vertices 
and thus our algorithm is more efficient. Moreover this algorithm is useful to test 
whether a language is of level 1/2 in the group hierarchy (see fl]) or whether 
it is recognizable by a reversible automaton (see El)- 

Moreover we obtain a new proof of the following proposition: 

Proposition 10. El A language of A* which is accepted by a reversible au- 
tomaton is closed for the profinite topology on A* . 

As suggested in m. we also have a new algorithm to compute the kernel of 
a finite monoid, which seems to have the same complexity as that developed in 
| |1 . Let us first mention an important corollary of Theorem 0 

Corollary 2. 0 Let A be a {n^m)- automaton on A whose set of states is Q = 
• I Qn}- One can compute in time 0{n^) a n x n boolean matrix such that 
the (i,j) entry is 1 if and only if there exists a path m in A from qi to qj such 
that [m]7T = 1. 



Theorem 9. Let M be a finite monoid generated by A. One can compute in 
time 0(|Mp) the kernel of M given by its Cayley graph. 

Finally we have a new characterization of closed rational languages of the 
free group. An automaton A on A is said to be locally dual if for each internal 
transition (p,a,q) of A, (q,a,p) also is a transition of A. 

We claim that 

Proposition 11. L is a profinitely closed subset language of F{A) if and only 
if there exists a locally dual automaton A on A such that L = L(A)Tr. 
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5 Conclusion 

Our goal was to develop an algorithm to compute the closure of a rational 
language given by a finite state automaton. Using a result by Pin and Reutenauer 
we obtained such an efficient algorithm. However we do not know any lower 
bound for this problem. Furthermore our algorithm returns a non deterministic 
automaton, and we do not know whether, given a rational language given by a 
deterministic automaton, one can compute a deterministic automaton accepting 
the closure of the language in polynomial time or space. 

The author would like to thank J.-E. Pin and P. Weil for fruitful comments. 
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Abstract. We look at a model of a queue system M that consists of 
the following components: 

1. Two nondeterministic finite-state machines W and R, each aug- 
mented with finitely many reversal-bounded counters (thus, each 
counter can be incremented or decremented by 1 and tested for 
zero, but the number of alternations between nondecreasing mode 
and nonincreasing mode is bounded by a fixed constant). W oi R 
(but not both) can also be equipped with an unrestricted pushdown 
stack. 

2. One unrestricted queue that can be used to send messages from W 
(the “writer”) to R (the “reader”). There is no bound on the length 
of the queue. When R tries to read from an empty queue, it receives 
an “empty-queue” signal. When this happens, R can continue doing 
other computation and can access the queue at a later time. 

W and R operate at the same clock rate, i.e., each transition (instruc- 
tion) takes one time unit. There is no central control. Note that since 
M is nondeterministic there are, in general, many computation paths 
starting from a given initial configuration. We investigate the decidable 
properties of queue systems. For example, we show that it is decidable 
to determine, given a system M, whether there is some computation 
in which R attempts to read from an empty queue. Other verification 
problems that we show solvable include (binary, forward, and backward) 
reachability, safety, invariance, etc. We also consider some reachability 
questions concerning machines operating in parallel. 



1 Introduction 

It is well-known that, in general, verification problems for infinite-state sys- 
tems are undecidable |Esp97| . In fact, even for systems with only two variables 
(or counters) that can be incremented or decremented by 1 and tested for 0, 
we already know that the halting problem is undecidable jMinbl | : hence, the 
emptiness, reachability, safety, and other problems are also undecidable. How- 
ever, certain restrictions can be placed on the workings of these systems that 
make them amenable to analysis. Some models that have been shown to have 
decidable properties are: pushdown automata |fjEM97IF W W 971 Wal96| . timed 
automata fAL)94| (and real-time logics |AH94IA( JU93IH1N S Y 94) ) . and various 
approximations on multicounter machines |C, 198113 W94| . 
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S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 145-|^^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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This paper is a contribution to the reachability and safety analysis of infinite- 
state transition systems that can be modeled by a queue system. The model that 
we analyze, call it QS, is system M that consists of the following components: 

1. Two nondeterministic finite-state machines W and R, each augmented with 
finitely many reversal-bounded counters (thus, each counter can be incre- 
mented or decremented by 1 and tested for zero, but the number of alter- 
nations between nondecreasing mode and nonincreasing mode is bounded 
by a fixed constant). W or R (but not both) can also be equipped with an 
unrestricted pushdown stack. 

2. One unrestricted queue Q that can be used to send messages from W (the 
“writer” ) to R (the “reader” ) . Thus, from time-to-time during the computa- 
tion, W can nondeterministically write a symbol in Q, and R can nondeter- 
ministically read a symbol from Q. There is no bound on the length of the 
queue. When R tries to read from an empty queue, it receives an “empty- 
queue” signal. When this happens, R can continue doing other computation 
and can access the queue at a later time. 

W and R operate at the same clock rate, i.e., each transition (instruction) takes 
one time unit. There is no central control. Note that since M is nondeterminis- 
tic there are, in general, many computation paths starting from a given initial 
configuration. 

We investigate the decidable properties of queue systems. For example, we 
show that it is decidable to determine, given a system M, whether there is some 
computation in which R attempts to read from an empty queue. Other verifi- 
cation problems that we show solvable include (binary, forward, and backward) 
reachability, safety, invariance, etc. We also consider some reachability questions 
concerning machines operating in parallel. 

A different model, where there is only one finite-state machine that controls 
the operation of the queue, can simulate a Turing machine (even when the ma- 
chine has no counters) . A restricted version of this model was recently studied in 
[ll KSOnj . The restriction is that the number of alternations between non-reading 
phase and non-writing phase is bounded by a constant. A non-reading (non- 
writing) phase is a period consisting of writing (reading) and no-changes, i.e., 
the queue is idle. For such systems, reachability and safety are decidable. De- 
cidable properties of other varieties of systems with queues have also been been 
studied (see, e.g., |BH99] 1. 

The paper has four sections, in addition to this section. Section 2 recalls the 
definition of pushdown automata with reversal-bounded counters (first intro- 
duced in |Iba78 | ) and cites some results we use in the paper. Section 3 presents 
the main results. Section 4 considers some reachability questions concerning ma- 
chines operating in parallel. Section 5 is a brief conclusion. 

2 Pushdown Automata with Reversal-Bounded Counters 

A pushdown automaton with reversal-bounded counters (PC A) CEiZH] is a 
nondeterministic one-way pushdown automaton augmented with k “reversal- 
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bounded” counters (for some k). The pushdown stack is unrestricted. Without 
loss of generality, we assume that the counters can only store nonnegative in- 
tegers, since the finite-state control can remember the signs of the numbers. 
Though not necessary (since it is one-way), we assume, for convenience, that 
the one-way read-only input to the PCA has left and right delimiters. A PCA 
without a pushdown stack is called a CA. PCAs, even CAs, are quite powerful. 
They can recognize rather complex languages. Decidability/complexity results 
concerning PCAs (CAs) have been obtained in [Iba.TSjCf^ . Some of the re- 
sults were used recently to show the decidability /complexity of some decision 
problems (containment, equivalence, disjointness, etc.) for database queries with 
linear constraints [IShhpiSHOOj . 

There are several papers that investigate verification problems for pushdown 
automata (e.g., p4PM97pFWW97IWal96j ). counter machines under various re- 
strictions (e.g., )C,T98IFS00ITSDPK00j , and pushdown automata with restricted 
counters (e.g., |BER95IBH9(im5QQj ). 

A fundamental result in |Tb^ is the following: 



Theorem 1. The emptiness problem for PCAs (i.e., given a PCA M, is the 
language, L{M), accepted by M empty?) is decidable. 



Remark 1: It has been shown in ITTTTm that the emptiness problem for CAs is 
decidable in time for some constant c, where n is the size of the machine, 
k is the number of counters, and r is the reversal-bound on each counter. We 
believe that a similar bound could be obtained for the case of PCAs. We will see 
that the decision questions (reachability, safety, etc.) investigated in this paper 
are reductions to the emptiness problem. 

PCAs can be generalized to have multiple input tapes (one head/tape). Thus, 
a fc-tape PCA M accepts a relation L{M) of fc-tuples of strings. A 1-tape PCA 
will simply be called a PCA. A A:-tape CA is fc-tape PCA without a stack. 

Corollary 1. The emptiness problem for multitape PCAs (hence, also multitape 
CAs) is decidable. 

Proof. We sketch the proof for the case k = 2 . Let M he a 2-tape PCA. We may 
assume without loss of generality that the two tapes of M use disjoint input 
alphabets. We construct a PCA M' such that L{M') is empty if and only if 
L{M) is empty. The idea of the construction is as follows: If (xi,X2) is an input 
to M, then the input to M' is a string x which is some interlacing of the symbols 
in xi and X2. Thus x with the symbols in xi (X2) deleted reduces to X2 (xi). 
Clearly M' can simulate the actions of the two input heads of M on input x. I 



3 Main Results 

We first look at a queue system with no pushdown stack. Let M be a QS with 
writer W, reader P, and queue Q. Both W and i? can use instructions of the 
following form: 
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q X X + 1 then goto p 

q \ X X — 1 then goto p 

q : if x#0 then goto p\ else goto p 2 
q : goto p 

q : goto Pi or goto p 2 

where x represents a counter, p, q, ... are states, and ^ is <, =, or >. We assume 
the states are labeled {1,2, ...}. Without loss of generality (since the states can 
remember the sign), we assume that the counters can only take on nonnegative 
values. Hence, we can assume that # is = (i.e., testing is only for checking 
whether a counter is zero). Note that the nondeterministic “goto p\ or goto P 2 ” 
instruction is sufficient to simulate other types of nondeterminism. In addition, 
W can use instructions of the form: 

q : write(a) then goto p 

where a is a symbol in the queue alphabet. R, on the otherhand, can use in- 
structions of the form: 

q : if queue is empty then goto p else read/delete symbol and do T 

where T is the form: 

if symbol = ai then goto pi else 
if symbol = 02 then goto p 2 else 



if symbol = a„ then goto 

where ai,...,a„ are the symbols in the queue alphabet, “delete” means remove 
symbol from the queue. 

Each counter in W and R is reversal-bounded in the sense that the number 
of times the counter changes mode from nondecreasing to nonincreasing and 
vice-versa is bounded by a constant, independent of the computation. So, for 
example, 

01123334555432110011233 

is 2-reversal. We assume that each instruction takes one time unit and that the 
states of W (R) are labeled 1,2, .... 

A configuration of M is a tuple a = {q, X,p, Y, w), where: 

1 . q is the state of R and A is a tuple of integer values of the counters of R. 

2. p is the state of W, Y is a tuple of integer values of the counters of W, and 
w is the content (string of symbols) of Q. 

Thus, a configuration represents the “total state” of the system at any given 
time. Note that a can be represented as a string where the components of the 
tuple are separated by markers and the states and counter values are written in 
unary. 
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Let a = (q,X,p,Y,w) and /3 = {q' , X' ,p' ,Y' ,w') be two configurations. 
We are interested in being able to check if [3 is reachable from a in 0 or more 
transitions. During the computation, when R attempts to access an empty queue, 
it receives an “empty-queue” signal. When this happens, R can continue doing 
other computation and can access the queue at a later time. We say that M is 
non-blocking with respect to a configuration a if M does not access an empty 
queue in any computation (i.e., sequence of moves) when started in configuration 
a. M is non-blocking if it is non-blocking with respect to any configuration a. 

Theorem 2. The following problems are decidable: 

1. Given a QS M and a configuration a, is M non-blocking with respect to a? 

2. Given a QS M , is it non-blocking? 

Proof. We only prove 2), the proof of 1) being similar. We show that given M, 
we can effectively construct a 2-tape CA M' such that L{M') is empty if and 
only if M is non-blocking. Since the emptiness problem for multitape CAs is 
decidable (by Corollary 1), the result follows. 

We describe the operation of M' . Let M' be given a pair of configurations 
(a,/3) represented by {{q,X,p,Y,w), {q' , X' ,p' ,Y' ,e)), where (q,X,p,Y,w) is on 
tapel and {qf X' ,p', Y', e) is on tape2. (Note that e denotes an empty queue.) M' 
first reads q,X,p,Y from tapel and q',X',p',Y' from tape2, and record these 
values. (M' uses auxiliary counters to record X,Y, X' ,Y' .) At this point, tapel 
head is at the beginning of w and tape2 head is at e. Then M' simulates the 
computation of R starting in state q and counter values X. M' uses a counter 
Cr to keep track of the running time of R. Immediately after R has read the last 
symbol of w on tapel, M' suspends the simulation (records the current values) 
and begins the simulation of W starting in state p and counter values Y . M' also 
uses a counter Cw to keep track of the running time of W . Immediately after 
W has written the first symbol, say o, on the queue (M' does not actually write 
a but records this symbol in the state), M' suspends the simulation of W and 
resumes the simulation of R until R reads a (which was recorded in the state). 
M' then resumes the simulation of W. The process of switching the simulations 
between W and R continues until M' “guesses” that W has written a symbol, 
say b, on a queue position after which blocking will occur. M' records the time 
tw when b was written (this time is actually the current value of Cw)- Then M' 
continues the simulation of R until R attempts to read past b. M' records the 
time tR this happens. M' accepts if i) the states and counters of R and W after 
the simulations are the same as the corresponding values that M' recorded before 
the simulations, and ii) Ir is less than or equal to tw (this condition indicates 
that M tried to access an empty queue). It follows that L{M') is empty if and 
only if M is non-blocking. I 

Remark 2: Theorem 2 does not hold for a QS model that has a second queue 
that can be used to send messages from R to W. It is easy to see that such a QS, 
even without counters, can simulate the computation of a Turing machine. A QS 
M guesses and verifies a halting sequence of instantaneous descriptions (IDs) of 
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the TM on blank tape: This is done as follows, where 

Wi and i?i belong to first machine of M and W2 and R2 belong to the second 
machine. The first machine operates as follows: W\ writes IDi in the first queue 
and then iterates the following: Ri reads /Hi+i from the second queue and W\ 
writes /Hi+2 on the first queue for i = 1,2,.... Similarly, the second machine 
iterates the following: R2 reads IDi from the first queue and W2 writes IDi+i 
in the second queue. It is obvious that the writing and reading on each queue 
can be appropriately scheduled so that the TM simulation can be carried out 
effectively. R is forced to access an empty queue if and only if the TM halts. 

Remark 3: Similarly, Theorem 2 does not hold when there are two queues Qi 
and Q2 that can be used to send messages from W to R (even when there are no 
counters), since such a system can simulate a TM. The idea is for W to nondeter- 
ministically guess and write a halting sequence of TM IDs Xi^X2#X^^...^IDk 
in Qi and a sequence of IDs Tl#F2#T3#...#Y)c in Q2 such that T) is the succes- 
sor of Xi. R reads Qi and Q2 and checks that Xi+i is the successor of Yi. R is 
then forced to access an empty queue if and only if Xi#Yi#X2#l2#---#-^fc#Tfc 
is a halting computation of a TM on blank tape. 

W and R can be augmented with a pushdown stack and use instructions of 
the form: 

q : if stack is empty then goto p else read/pop and do T 
where T is of the form: 

if top = oi then goto p\ else 

if top = 02 then goto p2 else 



if top = o„ then goto 

where oi, ...,an are the symbols in the pushdown alphabet. 

Corollary 2. It is decidable to determine, given a QS M where R or W (but 
not both) has a pushdown stack, whether M is non-blocking. 

Proof. The definition of configuration a. will now include the the pushdown con- 
tent. Thus, a = {q, X,p, Y, w, z), where z is the tape content (with the rightmost 
symbol of z the top of the stack). The proof of Theorem 2 generalizes to a QS 
M with a pushdown stack. For example, if R has a stack, in the construction 
above, M' also writes z on its stack before simulating R and W. The rest of the 
construction is similar. M' also checks at the end of the simulations that the 
pushdown stack content of R corresponds to the z' portion of tape2. However, 
there is a slight complication because the pushdown stack content is in “reverse” . 
If tape2 had z'” (where r denotes reverse) instead of z' , then there is no problem. 
One can get around this difficulty if the checking of the stack content with tape2 
is done during the simulations instead of waiting until the end of the simulations. 
This involves guessing, for each position of the stack, the last time R rewrites 
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this position, i.e., that the symbol would not be rewritten further in reaching 
configuration j3. So, e.g., if on stack position p, the symbol changes are Zi, ..., Zk 
for the entire computation, then Z^ is the last symbol written on the position, 
and M' checks after Zk is written that the p-th position of the stack word in (3 is 
Zfc. M' marks Z]^ in the stack and makes sure that this symbol is never popped 
or rewritten in the rest of the computation. We omit the details. I 

Remark 4: Corollary 2 does not hold when both R and W have a pushdown 
stack. In fact, it is undecidable to determine, given a QS M whose R and W 
have no counters but have each a one-turn pushdown stack (i.e., after popping, 
the stack can no longer push), whether M is non-blocking. The proof uses the 
undecidability of the Post Correspondence Problem (PCP). Let {X^Y) be an 
instance of the PCP, where X = {xi,...,Xn) and Y = (yi, ...,yn), each Xi,yi 
a non-null string over the binary alphabet, {0,1}. Let oi, ..., a„, ^, $ be new 
symbols. We construct a QM M which operates as follows. W uses its one-turn 
pushdown stack to nondeterministically write in Q a string of the form: 

(Here, w’’ denotes the reverse of w.) W writes the string symbol- by-symbol 
starting at time t = 0. After writing $, W goes into an infinite loop without 
further writing on the queue. Starting at time t = 1, R reads the symbols on 
the queue pushing on its one-turn pushdown stack. When it sees 

R then checks if 

= Vl-VlVn 

If R finds a discrepancy, it goes into an infinite loop without further accessing 
the queue. If the check is successful, R reads the $ and then attempts to read 
past this symbol. Thus, W will access an empty queue if and only if (X, Y) has 
a solution. The result follows since determining if an instance of the PCP has a 
solution is undecidable. 

Remark 5: It is also undecidable to determine, given a QS M where R and 
W have no reversal-bounded counters but have each an unrestricted counter, 
whether M is non-blocking. We sketch the proof. Consider deterministic one-way 
finite-state acceptors augmented with one unrestricted counter. It is undecidable 
to determine, given two such acceptors, Ai and A2, whether they accept disjoint 
languages. In fact, we can assume that Ai and A2 are real-time (i.e., the input 
head moves right at every step) fiTHTO] (see also We construct a QS 

M which operates as follows. Let # be a new symbol. Starting at time t = 
0, W guesses and writes a string w (symbol-by-symbol) on the queue, while 
simultaneously checking if w is accepted by Ai. If w is accepted (not accepted) 
by Ai, R writes ^ ($) and goes into an infinite loop without further writing 
on the queue. If Ai gets “stuck” (i.e., there is no next move) at some symbol 
that W has written, W writes $ and goes into an infinite loop without further 
writing on the queue. Thus the queue content will be a string of the form 
or of the form w$. R starts reading the queue at time t = 1 and simulates A2 
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on the symbols it reads. If A 2 accepts w and the symbol to the right of w is #, 
then i? knows that w is accepted by both Ai and A 2 - Then i? reads past #. 
Thus, i? will access an empty queue. If w is not accepted by A 2 and the symbol 
to its right is # or $, or if A 2 gets stuck while i? is reading the queue, then R 
goes into infinite loop without further accessing the queue. It follows that M is 
non-blocking if and only if Ai and A 2 do not accept a common string, which is 
undecidable. 

Next, we look at reachability. If M is a QS, let Reach{M) = set of all pairs 
of configurations (a, /3) such that a can reach configuration /? in 0 or more 
transitions. 

Corollary 3. We can effectively construct, given a non-blocking QS M where 
R or W (but not both) has a pushdown stack, a 2-tape PC A M' accepting 
Reach{M). 

Proof. The construction of M' is similar to the one in Theorem 2 and Corollary 
2. M' need only check that Cr = Cw at the end of the simulation, knowing that 
during the simulation, R does not see the empty-queue marker. I 

We are not able to prove the above corollary when M is unrestricted (i.e., 
may not be non-blocking) because for the simulation to proceed properly, we 
need to be able to tell whether or not the queue is empty every time R accesses 
the queue. However, we can prove the result when the QS has no pushdown 
stack. 

Theorem 3. We can effectively construct, given an unrestricted QS M with no 
pushdown pushdown stack, a 2-tape PCA M' accepting Reach(M). 

Proof. The construction of M' involves two cases. The first case is when R reads 
only symbols from w and does not read any new symbol written by W during 
the computation from a to (3. The second case is when R reads past w. M' 
begins by guessing which case to simulate. We describe only the operation of M' 
for the second case (the first one being similar) . Like in the proof of Theorem 2, 
M' switches simulations between R and W but now uses a pushdown stack to 
tell whether R is accessing an empty Q. During the simulation, the stack keeps 
track of the difference in the time t_R(j) the symbol of x was read by R and 
the time tw{j) the symbol of x was written by W. This is possible since 
the pushdown can be used as an unrestricted counter (i.e., there is no bound 
on the reversal) . At some point during the simulation, after R has read the k*^ 
symbol of x (for some k), M' guesses that R has read the last symbol it wants 
to read from the queue. M' continues the simulation of R and at some point 
guesses that R has reached the i?-component of configuration (3 which M' can 
verify (this includes checking that the current state and counter values are equal 
to what were recorded before the start of the simulations). Then M' resumes 
the simulation of W including checking that the symbols written by the queue 
appear in tape2. The simulation continues until at some point M' guesses that 
W has reached the IT-component of (3 which M' can verify. M' accepts if and 
only if Cr = Cw- I 
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In the rest of this section, we will consider only unrestricted QSs without a 
pushdown stack. The next result concerns nonsafety. 

Theorem 4. It is decidable to determine, given a QS M and two sets of con- 
figurations S (start set) and B (unsafe set^ accepted by CAs, whether there is 
a configuration in S that can reach a configuration in B. Thus nonsafety is 
decidable. 

Proof. From Theorem 3, let M' be a 2-tape PCA accepting Reach{M). Let 
Ms and Mb be the CAs accepting S and B, respectively. By using additional 
counters, we can modify M' to a 2-tape PCA M” which also checks that a {(3) 
on tapel (tape2) is accepted by Ms {Mb). Then M is unsafe if and only if 
L{M") is nonempty, which is decidable by Corollary 1. I 

Corollary 4. It is decidable to determine, given a QS M and two sets of con- 
figurations S (start set) and G (safe set^ accepted by a CA and a deterministic 
CA respectively, whether every configuration in S can only reach configurations 
in G. Thus invariance is decidable. 

Proof. It can be shown that if G is accepted by a deterministic CA Mq, then 
we can effectively construct a deterministic CA accepting the set of unsafe con- 
figurations B = the complement of G. (Note that this is not true if Mq is not 
deterministic.) The result follows from the above theorem. I 

Next, we show that forward reachability is computable. 

Theorem 5. We can effectively construct, given a QS M and a set of config- 
urations S accepted by a CA, a PCA accepting post*{M,S) = the set of all 
configurations reachable from configurations in S in 0 or more transitions. 

Proof. Let Ms be a CA accepting S. As in Theorem 4, we can construct a 2-tape 
PCA M' accepting the set of all pairs of configurations (o,/3) in Reach{M) such 
that a is accepted by M 5 . We can then construct from M' a PCA M” which 
deletes the first tape, and L{M”) = post*{M, S). I 

Similarly, for forward reachability we have: 

Corollary 5. We can effectively construct, given a QS M and a set of con- 
figurations S accepted by a CA, a PCA accepting pre*{M, S) = the set of all 
configurations that can reach configurations in S in 0 or more transitions. 

We can define a QS acceptor M by giving R and W one-way input tapes. We 
say that a pair of strings (x, y) is accepted by M if on this pair of input strings 
and starting with all counters of R and W zero, M reaches a configuration when 
R and W are both in accepting states. We can show that the set of all pairs of 
strings accepted by a QS acceptor can be accepted by a 2-tape PCA. 

If, however, the inputs to R and W are common, i.e., there is only one input 
tape from which R and W can read, the emptiness problem is undecidable (even 
when R and W are finite-state). The proof uses the idea in Remark 3. 
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4 Reachability in Parallel Machines 

The technique of using the reversal-bounded counters to record and compare 
various integers (like the running times of the machines) in the proofs in the 
previous section can be used to decide some reachability questions concerning 
machines operating in parallel. The motivation for studying reachability is the 
following: 

Suppose we are given a problem with input domain X, n nondeterministic 
machines M\, Mn, and an input x in X which can be partitioned into n 
components x\, ...,Xn- Each machine Mi is to “work” on Xi to obtain a partial 
solution Hi. The solution to x can be derived from yi, ...,yk- We want to know if 
there is a computation (note that since the machines are nondeterministic, there 
may be several such computation) in which each Mi on input Xi outputs yi and 
their running times ti's satisfy a given linear relation (definable by a Presburger 
formula) . An example of a relation is for each ti to be within 5% of the average of 
the running times (i.e., the load is approximately balanced among the Mi’s), or 
for the ti’s to satisfy some precedence constraints. Note that the ti’s need not be 
optimal as long as they satisfy the given linear relation. A stronger requirement 
is to find the optimal running time ti of each Pi and determinine if 
satisfy the given linear relation. 

The questions above are unsolvable in general, even when the machines work 
independently (no sharing of data/variables). However, we are able to show that 
they are solvable for reversal-bounded multicounter machines with a pushdown 
stack. We illustrate below. 

Let Ml and M2 be nondeterministic reversal-bounded multicounter machines 
with a pushdown stack but no input tape operating independently in parallel. 
Thus these machines are PCAs without input tape. Call them PCMs. For i = 
1 , 2 , denote by ai a configuration (qi,Xi,Wi) of Mi (state, counter values, stack 
content). Let L(m,n) be a linear relation definable by a Presburger formula. 
Define Reach{Mi, M2, L) to be the set of all 4 -tuples («i, / 3 i, «2, /?2) such that 
for some ti,t2. Mi when started in configuration ai can reach configuration fii 
at time ti, and t\ and t2 satisfy L, i.e., L{ti,t2) is true. (Thus, e.g., if the linear 
relation is t\ = t2, then we want to determine if M\ when started in configuration 
Q!i reaches j3i at the same time that M2 when started in 02 reaches (32-) 

Theorem 6. We can effectively construct, given two PCMs Mi and M2 and a 
Presburger relation L, a fftape PCA M accepting Reach{Mi, M2, L) 

Proof. M, when given , / 3 i , 02 , /32 in its 4 tapes, first simulates the computation 
of Ml to check that ai can reach Pi recording the running time ti in a counter. 
M2 then simulates M2, recording the running time t2 in another counter. Then 
M checks that ti and t2 satisfies the given linear relation (which can be verified 
since Preburger formulas can be evaluated by nondeterministic reversal-bounded 
multicounter machines [Iba 78 ]). I 

We can allow the machines Mi and M2 to share common read-only data, 
i.e., each machine has a one-way read-only input head. A configuration ai will 
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now be a 4-tuple {qi, Xi,Wi,hi), where hi is the position of the input head on 
the common input x. 

Suppose only one of M\ and M2 has a pushdown stack. The machine M to 
decide reachability is now a 5-tape PCA. The input 5-tuple is (oi, /3i, « 2 , /? 2 , a;). 
M simulates Mi and M2 in parallel on input x, recording their running times. If 
one of the machines, e.g., M\ advances its input head to the next input symbol, 
but M2 has not yet read the current input symbol, M does not advance its input 
head and “suspends” the simulation of Mi until M 2 has read the current symbol 
or M guesses that M2 will not be reading further on the input to reach its target 
configuration. We omit the details. Thus we have: 

Corollary 6 . Reachability in two PCMs (only one of which has a pushdown 
stack) with a shared read-only input can he accepted by a 5 -tape PCA. 

The above corollary is not true if both Mi and M 2 have a one-turn stack (or 
an unrestricted counter), since reachability is undecidable, even if the machines 
have no reversal-bounded counters and the linear relation is t\ = ^ 2 - The proof is 
similar to the one given in Remark 4 (Remark 5). Similarly, if Mi and M 2 share 
two common input tapes (i.e., each machine has two input heads for accessing the 
common tapes), reachability is undecidable, even if the machines are finite-state 
and the linear relation is t\ = t2- The proof is similar with Remark 3. 

Note that the results above generalize to any number, k, of machines Mi 
{i = 1, ...,k) operating in parallel. 

5 Conclusions 

We introduced a new model of a queue system and analyzed the solvability of ver- 
ification problems such as (binary, forward, and backward) reachability, safety, 
and invariance. We also looked at reachability questions concerning machines 
operating in parallel. In the future we would like to investigate the decidability 
of liveness properties for these systems. We would also like to investigate the 
complexity of the verification procedures described in this paper. Finally, we 
mention an interesting open problem: Can we prove Theorem 3 when one of R 
or W (but not both) has a pushdown stack? 
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Abstract. We describe a general automata-theoretic approach for ana- 
lyzing the verihcation problems (binary reachability, safety, etc.) of dis- 
crete timed automata augmented with various data structures. We give 
examples of such data structures and exhibit some new properties of 
discrete timed automata that can be verified. We also briefly consider 
reachability in discrete timed automata operating in parallel. 



1 Introduction 

Ever since the introduction of the model of a timed automaton there 

have been many studies that extend the expressive power of the model (e.g. 

For instance considers models 

of hybrid systems of finite automata supplied with (unbounded) discrete data 
structures and continuous variables and obtains decidability results for several 
classes of systems with control variables and observation variables. [K tl99f( ;.I98] 
shows that the binary reachability of timed automata is expressible in the ad- 
ditive theory of the reals. [niEKSOTi] characterizes the binary reachability of 
discrete timed automata (i.e., timed automata with integer-valued clocks) aug- 
mented with a pushdown stack, while [IDSOO] looks at queue-connected discrete 
timed automata. 

In this paper, we extend the ideas in [D.IBKBQQBDSOQ] and describe a general 
automata-theoretic approach for analyzing the verification problems of discrete 
timed automata augmented with various data structures. Formally, let C be a 
class of nondeterministic machines with reversal-bounded counters (i.e., each 
counter can be incremented or decremented by 1 and tested for zero, but the 
number of alternations between nondecreasing mode and nonincreasing mode 
is bounded by a constant, independent of the computation) and possibly other 
data structures, e.g., a pushdown stack, a queue, a read-write worktape, etc. Let 
A be a discrete timed automaton and M be a machine in C. Denote by A © M 
the combined automaton, i.e., A augmented with M (in some precise sense to 
be defined). We show that if C has a decidable emptiness problem, then the 
(binary, forward, backward) reachability, safety, and invariance for A © M are 
also solvable. We give examples of such C’s and exhibit some new properties of 
discrete timed automata that can be verified: 
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1. For example, let A be a discrete timed automaton with k clocks. For a 
given computation of A, let Vi be the number of times clock i resets, i = 

Suppose we are interested in computations of A in which the r^’s 
satisfy a Presburger formula /, i.e., we are interested in the set Q of pairs 
of configurations (a, (3) such that a can reach /3 in a computation in which 
the clock resets satisfy /. (A configuration of A is a pair (q,U), where q is 
a state and U is the set of clock values.) We can show that Q is Presburger. 
One can also put other constraints, like introducing a parameter U for each 
clock i, and consider computations where the first time i resets to zero is 
before (or after) time Then Q{ti, is Presburger. 

2. As another example, suppose we are interested in the set S of pairs of con- 
figurations (a, /3) of a discrete timed automaton A such that there is a com- 
putation path (i.e., sequence of states) from a to (3 that satisfies a property 
that can be verified by a machine in a class C. If C has a decidable emptiness 
problem, then S is effectively computable. For example, suppose that the 
property is for the path to contain three non-overlapping subpaths (i.e., seg- 
ments of computation) which go through the same sequence of states, and 
the length of the subpath is no less than 1/5 of the length of the entire path. 
We can show that S is computable. 

The constraints in 1 and 2 can be combined; thus, we can show that the set of 
pairs of configurations that are in both Q and S is computable. 

3. We can equip the discrete timed automaton with one-way write-only tapes 
which the automaton can use to record certain information about the com- 
putation of the system (and perhaps even require that the strings appearing 
in these tapes satisfy some properties). Such systems can effectively be an- 
alyzed. 

Finally, we briefly look at reachability in machines (i.e., Ai 0 M\ and A 2 © M 2 ) 
operating in parallel. 



2 Combining Discrete Timed Automata with Other 
Machines 

A timed automaton is a finite-state machine augmented with finitely 

many real-valued clocks. All the clocks progress synchronously with rate 1, ex- 
cept that a clock can be reset to 0 at some transition. Here, we only consider 
integer- valued clocks. A clock constraint is a Boolean combination of atomic clock 
constraints in the following form: x^fc, x — yifc, where # denotes ^, ^, <, >, or 
=, c is an integer, x, y are integer- valued clocks. Let Cx be the set of all clock 
constraints on clocks X. Let Z be the set of integers with N the set of nonnegative 
integers. Formally, a discrete timed automaton A is a tuple {S,X,E) where 

1. S' is a finite set of (eontrol) states, 

2. A is a finite set of clocks with values in N, and 

3. E C S X 2^ X Lx X S is a finite set of edges or transitions. 
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Each edge (s, A, I, s') in E denotes a transition from state s to state s' with 
enabling condition I € Ex and a set of clock resets A C X. Note that A may 
be empty. The meaning of a one-step transition along an edge {s^X,l,s') is as 
follows: 

— The state changes from s to s'. 

— Each clock changes. If there are no clock resets on the edge, i.e., A = 0, then 
each clock x G X progresses by one time unit. If A yf 0, then each clock 
a: G A is reset to 0 while each x ^ X remains unchanged. 

— The enabling condition I is satisfied. 

The notion of a discrete timed automaton defined above is slightly differ- 
ent, but easily shown equivalent to the standard definition of a (discrete) timed 
automaton in [IAD94| (see I'DrSKSQQ] ) . 

Now consider a class C of acceptors, where each machine M in the class is a 
nondeterministic finite automaton augmented with finitely many counters, and 
possibly other data structures. Thus, M =< Q, E,qo, F, K, D,S >, where Q 
is the state set, E is the input alphabet, go is the start state, F is the set of 
accepting states, K is the set of counters, D the other data structures, and 6 is 
the transition function. In the move S{q,a, Si, Sk,loc) = {ti, 

— q is the state, a is e or a symbol in E, Si is the status of counter i (i.e., zero 
or non-zero), and loc is the “local” portion of the data structure(s) D that 
influences (affects) the move. For example, if Z) is a pushdown stack, then 
loc is the top of the stack; if iA is a two-way read-write tape, then loc is the 
symbol under the read-write head; if ZA is a queue, then loc is the symbol 
in the front of the queue or e if the queue is empty. Note that ZA can be a 
combination of several data structures (e.g., several stacks and queues). 

— are the choices of moves (note that M is nondeterministic). Each 
ti is of the form (p,d±, ...,dk,act), which means increment counter i by di 
(1, 0, or —1), perform act on loc, and enter state p. For example if ZA is 
a pushdown stack, act pops the top symbol and pushes a string (possibly 
empty) onto the stack; if ZA is a two-way read-write tape, act rewrites the 
symbol under the read- write head and moves the head one cell to the left, 
right, or remains on the current cell; if ZA is a queue, then act deletes loc (if 
not e) from the front of the queue, and possibly add a symbol to the rear of 
the queue. 

Note that the counters can only hold nonnegative integers. There is no loss of 
generality since the states can remember the signs. 

The language accepted by M is denoted by L{M). We will only be interested 
in C’s with a decidable emptiness problem. This is the problem of deciding for 
a given acceptor in C, whether L{M) is empty. Since the emptiness problem for 
finite automata augmented with two counters is undecidable we will need 

to put some restrictions on the operation of the counters. 

Let r be a nonnegative integer. We say that a counter is r-reversal if the 
counter changes mode from nondecreasing to nonincreasing and vice-versa at 
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most r times, independent of the computation. So, for example, a counter whose 
values change according the pattern 01123334555432110011233 
is 2-reversal. When we say that the counters are reversal-bounded, we mean that 
we are given an integer r such that each counter is r-reversal. From now on, we 
will assume that the acceptors in C have reversal-bounded counters. 

We can extend the acceptors in C to multitape acceptors by providing them 
with multiple one-way read-only input tapes. Thus, a fc-tape acceptor now ac- 
cepts a fc-tuple of words (strings). We call the resulting class of acceptors C{k). 
The emptiness problem for C{k) is deciding for a given fc-tape acceptor M, 
whether it accepts an empty set of fc-tuples of strings. We denote C(l) simply 
by C. One can easily show the following: 

Theorem 1. If the emptiness problem for C is decidable, then the emptiness 
problem for C{k) is decidable. 

In the rest of the paper, we will assume that C has a decidable emptiness 
problem. In the area of verification, we are mostly interested in the “behavior” of 
machines rather than their language-accepting capabilities. When dealing with 
machines in C without inputs, we shall refer to them simply as machines. Thus, 
when we say “a machine M in C”, we mean that M has no input tape. 

Let A be a discrete timed automaton and M a machine in class C (hence, 
M has no input tape!). Let ^ © M be the machine obtained by augmenting 
A with M. So, e.g., if M is a machine with a pushdown stack and reversal- 
bounded counters, then A® M will be a discrete pushdown timed automaton 
with reversal-bounded counters. We will describe more precisely how A (B M 
operates later. A configuration a of A © M is a 5-tuple (s, U, q, V, v{D)), where s 
and U are the state and clock values of A, and q, V, v{D) are the state, counter 
values, and data structure values of M (e.g., if D is a pushdown stack, then 
v{D) is the content of the stack; if L> is a queue, then v{D) is the content of the 
queue). Let Reach{A © M) be the set of all pairs of configurations {a, (3) such 
that a can reach /3. This set is the binary reachability of A©M. We assume that 
the configurations are represented as strings over some alphabet, where the com- 
ponents of a configuration are separated by markers and the clock and counter 
values represented in unary. We also assume that each of the following tasks 
can be implemented on a machine M' in C: (i) M' , when given a configuration 
a = (s, U, q, V, v{D)) of A(BM on its input tape, can represent this configuration 
in its counters and data structures, i.e., M' can read a and record the states s 
and q, store the set of values of U and V in appropriate counters, and store v{D) 
in its data structures, (ii) M' , when given a configuration a on its input tape, 
can check if a represents its current configuration (this task is the converse of 

(i))- 

In the following, A is a discrete timed automaton and M is a machine in C; 
FCA refers to a nondeterministic finite automaton (acceptor) augmented with 
reversal-bounded counters. 
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Theorem 2. We can effectively construct a 2-tape acceptor in C(2) accepting 
Reach{A 0 M). Note that the input to the 2-tape acceptor is a pair of configu- 
rations {a,l3), where a (j3) is on the first (second) tape. 

We sketch the proof of the above theorem in the next section for a particular 
class C. We omit the proofs of the next four theorems in this extended abstract. 

Theorem 3. If I (the initial set) and P (the unsafe set) are two sets of config- 
urations of A® M , let BAD be the set of all configurations in I that can reach 
configurations in P. If I and P can be accepted by PC As, then we can effectively 
construct an acceptor in C accepting BAD. Hence, nonsafety is decidable with 
respect to P. 

Since the complement of a language accepted by a deterministic FCA can 
also be accepted by an FCA m, we also have: 

Theorem 4. If I and P (the safe set) are two sets of configurations of A (B M, 
let GOOD be the set of all configurations in I that can only reach configurations 
in P. If I can be accepted by an FCA and P can be accepted by a deterministic 
FCA, then we can decide whether GOOD = I . Hence, invariance is decidable 
with respect to P. 

We can show that forward reachability is computable. 

Theorem 5. Let I be a set of configurations accepted by an FCA. We can ef- 
fectively construct an acceptor in C accepting post* {A © M,I) = the set of all 
configurations reachable from configurations in I. 

Similarly, for backward reachability we have: 

Theorem 6. Let I be a set of configurations accepted by an FCA. We can ef- 
fectively construct an acceptor in C accepting pre*{A © M,I) = the set of all 
configurations that can reach configurations in I . 

We can equip A® M with a one-way input tape. In order to do this, we can 
simply change the format of the transition edge of A by a 5-tuple (s. A, I, s', a) 
in E, where a denotes an input symbol or e (the null string). The meaning 
of this edge is like before, but now A can read a symbol or a null string at 
each transition. We also define a subset of the states of A as accepting states. 
Then A © M becomes an acceptor. Note that A and M will now start on some 
prescribed initial configurations (e.g., A is initialized to its start state with all 
clocks zero, M is initialized to its start state with all counters zero and the other 
data structures properly initialized). We can prove: 

Theorem 7. It is decidable to determine, given an acceptor A © M , whether 
A © M accepts the empty set. 

One can extend the A® M acceptor to have multiple input tapes. Then like 

Theorem n we have: 
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Corollary 1. It is decidable to determine, given a multitape acceptor A © M , 
whether A® M accepts the empty set. 

We can also equip the multitape A®M acceptor with one-way output tapes. 
But, clearly, these output tapes can also be viewed as input tapes (since writing 
can be simulated by reading). Hence, the analysis of a multi-input-tape multi- 
output-tape A® M reduces to the analysis of multi-input-tape A © M. 

3 Examples of C 

We sketch the proof of Theorem 2 for the class C, where each machine is a 
nondeterministic machine with a pushdown stack and finitely many reversal- 
bounded counters. Call a machine in this class a PCM, and PCA when it has 
an input tape (i.e., it is an acceptor). It is known that the emptiness problem 
for PCAs is decidable m- Let A be a discrete timed automaton and M he a 
PCM. We describe precisely how A © M operates. 

A configuration of the timed automaton A is of the form {s,U), where s is 
the state and U is the set of clock values. Now machine M has states, pushdown 
stack, and reversal-bounded counters. A move of M is defined by a transition 
function 6. If 6{q, Z, si, ..., Sk) = {t \, ..., tm}, then 

— q is the state, Z is the topmost symbol, and Sj is the status of counter i (i.e., 
zero or non-zero). 

— ti, ..., tm are the choices of moves (note that M is nondeterministic). Each ti 
is of the form (p, w, d\, ..., dk), which means pop Z and push string w (which 
is possibly empty) onto the stack, increment counter i by di (1, 0, or — 1), 
and enter state p. 

A configuration of M can be represented by a tuple of the form (q, V, w), where 
q is a state, V is the set of values of the counters, and w is the content of the 
stack with the rightmost symbol the top of the stack. 

A transition of the combined machine A © M is now a tuple (s. A, I, s', 
ENTER{M,R)), where {s,X,l,s') is as in a timed automaton. The combined 
transition is now carried out in two stages. Like before, A (the timed automaton 
component of the combined machine) makes the transition based on (s. A, I, s'). It 
then transfers control to machine M by executing the command ENTER{M, R), 
where i? is a one-step transition rule: R{s, X,l, s' ,q, Z, si, ..., Sk) = {ti, ...,tm}- 
Note that outcome this transition (i.e., the right side of the rule) not only de- 
pends on (s. A, I, s'), but also on the current state, status of the counters, and the 
topmost symbol of the stack) . This R is then followed by a sequence of transitions 
by M (using the transition function S). Thus the use of ENTER{M, R) allows 
the combined machine to update the configuration of M through a sequence of 
M’s transitions. After some amount of computation, M returns control to A 
by entering a special state or command RETURN. When this happens, A will 
now be in state s' . Thus the computation of A © M is like in a timed automa- 
ton, except that between each transition of A, the system calls M to do some 
computation. 
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A configuration of the system is a tuple of the form a = (s, C7, q, w). Thus, 
a configuration is one after the execution of a (possibly empty) sequence of 
(ENTER, RETURN) commands. Note that a configuration can be represented 
as a string where the clock values U and counter values V are represented in 
unary and the components of the tuple separated by markers. 

As defined earlier, the binary reachability is Reach{A © M) = the set of all 
pairs of configurations (a, /?), where a can reach j3. We will show that Reach{A(B 
M) can be accepted by a 2-tape PCA. Note that the input to the acceptor is a 
pair of strings (a,/3), where a (/3) is on the the first (second) tape. 

First we note that we can view the clocks in a discrete timed automaton A 
as counters, which we shall also refer to as clock-counters. In a reversal-bounded 
multicounter machine, only standard tests (comparing a counter against 0) and 
standard assignments (increment or decrement a counter by 1, or simply no- 
change) are allowed. But clock-counters in A do not have standard tests nor 
standard assignments. The reasons are as follows. A clock constraint allows com- 
parison between two clocks like X 2 ~xi > 7. Note that using only standard tests 
we cannot directly compare the difference of two clock-counter values against an 
integer like 7 by computing X 2 — xi in another counter, since each time this com- 
putation is done, it will cause at least a counter reversal, and the number of such 
tests during a computation can be unbounded. The clock progress x \= x + 1 
is standard, but the clock reset a; := 0 is not. Since there is no bound on the 
number of clock resets, clock-counters may not be reversal-bounded (each reset 
causes a counter reversal). 

We first prove an intermediate result. Define a semi-PC A as a PCA which, in 
addition to a stack and reversal-bounded counters, has clock-counters that use 
nonstandard tests and assignments as described in the preceding paragraph. 

Lemma 1. We can effectively construct, given a discrete timed automaton A 
and and PCM M, a 2-tape semi-PCA B accepting Reach{A(B M). 

Proof. We describe the construction of the 2-tape semi-PCA B. Given a pair of 
configurations (a, /3) on its two input tapes, B first copies a into its counters and 
stack (these include the clock-counters). Then B simulates the (“alternating” 
mode of) computation of A © M starting from configuration a as described 
above. It is clear that B can do this. After some time, B guesses that it has 
reached the configuration (3. It then checks that the values of the counters and 
stack match those on the second input tape. B accepts if the check succeeds. 
However, there is a slight complication because the pushdown stack content is 
in “reverse” . If the stack content on the second tape is written in reversed, there 
is no problem. One can get around this difficulty if the comparison of the stack 
content with the second tape is done during the simulation instead of waiting 
until the end of the simulation. This involves guessing, for each position of the 
stack, the last time M rewrites this position, i.e., that the symbol would not be 
rewritten further in reaching configuration (3. We omit the details. I 

The next lemma converts the 2-tape semi PCA to a 2-tape PCA. The proof 
uses a technique in [DIBKSOn] (see also [IDSOOj l. 
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Lemma 2. We can effectively construct from the 2-tape semi-PCA B, a 2-tape 
PC A C equivalent to B. 

Proof. The 2-tape PCA C operates like B, but the simulation of A©M differs in 
the way A is simulated. Let A have clock-counters Xi,...,Xk- Let m be one plus 
the maximal absolute value of all the integer constants that appear in the tests 
(i.e., the clock constraints on the edges of A in the form of Boolean combinations 
of Xiffc, Xi — Xjffc with c an integer). Denote the finite set {— m, • • • , 0, • • • , m} 
by [m]. Define two finite tables with entries and bi for 1 ^ ^ fc. Each 

entry can be regarded as a finite state variable with states in [to]. Intuitively, 
Qij is used to record the difference between two clock values of Xi and Xj , and bt 
is used to record the clock value of Xi. During the computation of A, when the 
difference Xi — Xj (or the value Xi) goes above to or below —to, a^- (or bi) stays 
the same as to or —to. 

C simulates A exactly except that it uses aijffc for the test Xi—Xjffc and biffc 
for the test xiffc, with —to < c < m. One can prove (by induction) that doing 
this is valid: Each time after C updates the entries by executing a transition, 
Xi — Xjffc iff aijffc, and Xiffc iff &i#c, for all 1 ^ i, j ^ k and for each integer 
c G [to — 1] . Th details for setting the initial values of the entries of an bi ) 
and updating them are given in the full paper. 

Thus clock-counter comparisons are replaced by finite table look-up and, 
therefore, nonstandard tests are not present in C. Finally, we show how non- 
standard assignments of the form Xi := 0 (clock resets) in machine C can be 
avoided. 

Clearly after eliminating the clock comparisons, the clock-counters in C do 
not participate in any tests except: 

— At the beginning of the simulation when the initial values of the xfs are used 
to compute the initial values of the a^’s and the bfs as described above; 

— At the end of the simulation when the final values of the xfs are compared 
with the second input tape to check whether they match those in f3. 

Thus, for each Xi, during the simulation of A but before the last reset of Xi, the 
actual value of Xi is irrelevant. We describe how to construct a 2-tape PCA D 
from C such that in the simulation of A, no nonstandard assignment is used. 
For each clock Xi in A, there are two cases. The first case is when Xi will not 
be reset during the entire simulation of C. The second case is when Xi will be 
reset. D guesses the case for each Xi. For the first case, Xi is already reversal- 
bounded, since the nonstandard assignment Xi := 0 is not used. For the second 
case, D first decrements Xi to 0. Then D simulates C. Whenever a clock progress 
Xi := Xi -I- 1 or a clock reset Xi := 0 is being executed by A, D keeps Xi as 0. 
But, at some point when a clock reset Xi := 0 is being executed by A, D guesses 
that this is the last clock reset for Xj. After this point, D faithfully simulates a 
clock progress Xj := Xi -I- 1 executed by A, and a later execution of a clock reset 
Xi := 0 in A will cause D to abort abnormally (since the guess of the last reset 
of Xi was wrong). Thus D uses only standard assignments Xi := Xi -I- 1, Xi := Xi, 
and Xi := Xi — 1 initially to bring Xi to 0 (for the second case). I 
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From the above lemmas, we have: 

Theorem 8. We can effectively construct, given a discrete timed automaton A 
and and a PCM M , a 2-tape PC A accepting Reach{A 0 M) . 

One can generalize Theorem 0 Extend a PCA acceptor by allowing the 
machine to have multiple pushdown stacks. Thus the machine will have multiple 
reversal-bounded counters and multiple stacks (ordered by name, say Si , ..., Sm)- 
The operation of the machine is restricted in that it can only read the topmost 
symbol of the first nonempty stack. Thus a move of the machine would depend 
only on the current state, the input symbol (or e), the status of each counter 
(zero or nonzero), and the topmost symbol of the first stack, say Si, that is not 
empty (initially, all stacks are set to some starting top symbol) . The action taken 
in the move consists of the input being consumed, each counter being updated 
(+1, -1, 0), the topmost symbol of Si being popped and a string (possibly empty) 
being pushed onto each stack, and the next state being entered. This acceptor, 
call it MPCA, was studied in |U()()| as a generalization of a PCA m and a 
generalization of a multipushdown acceptor |CijCCt)(I| . Thus an MPCA with 
only one stack reduces to a PCA. 

By combining the techniques in E7H1 and IKJBCC95I . it was shown in lUUUI 
that the emptiness problem for MPCAs is decidable. An MPCA without an 
input tape will be called an MPCM. By a construction similar to that of The- 
orem 0 we can prove the next result. Note that checking that the contents of 
the stacks at the end of the simulation are the same as the stack words in the 
target configuration does not require the latter to be in reverse (or need special 
handling), since we can first reverse the stack contents by using another set of 
pushdown stacks and then check that they match the stack words in the target 
configuration. 

Theorem 9. We can effectively construct, given a discrete timed automaton A 
and an MPCM M, a 2-tape MPCA accepting Reach{A(B M). 

Other examples of classes C that can be shown to have a decidable emptiness 
problem are given below. Thus, the results in Section 2 apply. 

1. Nondeterministic machines with reversal-bounded counters and a two-way 
read/ write worktape that is restricted in that the number of times the head 
crosses the boundary between any two adjacent cells of the worktape is 
bounded by a constant, independent of the computation (thus, the worktape 
is finite-crossing). There is no bound on how long the head can remain on a 

cell rrnFm . 

2. Nondeterministic machines with reversal-bounded counters and a queue 
that is restricted in that the number of alternations between non-deletion 
phase and non-insertion phase is bounded by a constant (iBSQQj . A non- 
deletion (non-insertion) phase is a period consisting of insertions (deletions) 
and no-changes, i.e., the queue is idle. Without the restriction emptiness is 
undecidable since it is known that a finite-state machine with an unrestricted 
queue can simulate a Turing machine. 
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Finally, as mentioned in the paragraph preceding Theorem Q we can provide the 
machine A® M with an input tape. The language accepted by such an acceptor 
can be shown to be accepted by an acceptor M' which belongs to the same class 
as M (the simulation is similar to the one described in Lemmas ^ and El) • Thus, 
Theorem |7| follows. 

4 Applications 

In this section we exhibit some properties of timed automata that can be verified 
using the results above. 

Example 1. (Real-time) pushdown timed systems with “observation” counters 
were studied in !RFT},95j . The purpose of these counters is to record information 
about the evolution of the system and to reason about certain properties (e.g., 
number of occurrences of certain events in some computation) . The counters do 
not participate in the dynamic of the system, i.e., they are never tested by the 
system. A transition edge specifies for each observation counter an integral value 
(positive, negative, zero) to be added to the counter. Of interest are the values 
of the counters when the system reaches a specified configuration. It was shown 
in [HFR.flbj that region reachability is decidable for these systems. 

Clearly, for the discrete case, such a system can be simulated by the machine 
A © M described in the previous section. We associate in M two counters for 
each observation counter: one counter keeps track of the positive increases and 
the other counter keeps track of the negative increases. When the target config- 
uration is reached, the difference can be computed in one of the counters. Note 
that the sign of the difference can be specified in another counter, which is set 
to 0 for negative and 1 for positive. Thus, from Theorems EHHl (binary, forward, 
backward) reachability, safety, and invariance are solvable for these systems. 

Example 2. Let A be a discrete timed automaton and M he & nondeterministic 
pushdown machine with reversal-bounded counters. For a given computation of 
A © M, let Ti be the number of times clock Xi resets. Suppose we are interested 
in computations in which the r^’s satisfy a Presburger formula /, i.e., we are 
interested in (a, (3) in Reach{A © M) such that a can reach /3 in a computation 
in which the clock resets satisfy /. It is known that a set of fc-tuples is definable 
by a Presburger formula / if and only if it is definable by a reversal-bounded 
multicounter machine m- (Thus, a machine Mf with no input tape but with 
reversal-bounded counters can be effectively constructed from / such that when 
the values of the first k counters are set to the fc-tuple and all the other counters 
initially zero, Mf enters an accepting state if and only if the fc-tuple satisfies /. 
In fact, Mf can be made deterministic IZH1-) It follows that we can construct 
a 2-tape pushdown acceptor with reversal-bounded counters M' accepting the 
set Q of pairs of configurations (a, (3) in Reach{A © M) such that a can reach 
/3 in a computation in which the clock resets satisfy /. One can also put other 
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constraints, like introducing a parameter ti for each clock i, and consider com- 
putations where the first time i resets to zero is before (or after) time U. We can 
construct a 3 -tape acceptor M" from M' accepting M” first reads 

the parameters t^’s (which are given on the third input tape) and then simulates 
M', checking that the constraint on the first time clock i resets is satisfied. Note 
that if M has no pushdown stack, then Q and Q{ti, are Presburger. 



Example 3 . As another example, suppose we are interested in the set S of pairs 
of configurations (a, ( 3 ) of a discrete timed automaton A such that there is a 
computation path (i.e., sequence of states) from a to j 3 that satisfies a property 
that can be verified by an acceptor in a class C. If C has a decidable emptiness 
problem, then S is effectively computable. For example, suppose that the prop- 
erty is for the path to contain three non-overlapping subpaths (i.e., segments 
of computation) which go through the same sequence of states, and the length 
of the subpath is no less than 1/5 of the length of the entire path. Thus if p is 
the computation path, there exist subpaths pi,...,p7 (some may be null) such 
that p = P1P2P3PAP5P6P7 -I where P2,P4, and pe go through the same sequence 
of states, and length of p2 = length of p4 = length of pq is no less than 1/5 of 
the length of p. We can check this property by incorporating a finite-crossing 
read-write tape to the machine (actually, the head need only make 5 crossings 
on the read-write tape). 



Example /. We can equip A® M with one-way write-only tapes which the ma- 
chine can use to record certain information about the computation of the system 
(and perhaps even requiring that the strings appearing in these tapes satisfy 
some properties). From Corollary ^ such systems can effectively be analyzed. 

5 Reachability in Parallel Discrete Timed Automata 

The technique of using the reversal-bounded counters to record and compare 
various integers (like the running times of the machines) in the proofs in Section 3 
can be used to decide some reachability questions concerning machines operating 
in parallel. We give two examples below. 

Let A\,A2 be discrete timed automata and Mi, M2 be PCMs. Recall from 
Section 3 that a configuration of Ai © Mi is a 5 -tuple = {si,Ui,qi,Vi,Wi). 
Suppose we are given a pair of configurations (ai,/ 3 i) of A\ © M\ and a pair 
of configurations (021 /?2) of A2 © M2, and we want to know if Aj © Mi when 
started in configuration can reach configuration j 3 i at some time ti, with 
ti and t2 satisfying a given linear relation L(ti,t2) definable by a Presburger 
formula. (Thus, e.g., if the linear relation is t\ = t2, then we want to determine 
if Ai © Ml when started in configuration ai reaches /?i at the same time that 
A2 ©M2 when started in «2 reaches (32-) This reachability question is decidable. 
The idea is the following. First note that we can incorporate a counter in Mi 
that records the running time ti of Ai © Mi . Let Zi be a 2 -tape PGA accepting 
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R{Ai © Mi). We construct a 4-tape PCA Z which, when given oi, /3i, 02, /32 in 
its 4 tapes, first simulates the computation of Z\ to check that a\ can reach (3\, 
recording the running time t\ (which is in configuration f3i) of Ai © Mi in a 
counter. Z then simulates Z2- Finally, Z checks that the running times ti and t2 
satisfy the given linear relation (which can be verified since Presburger formulas 
can be evaluated by nondeterministic reversal-bounded multicounter machines) . 
Since the emptiness problem for PCAs is decidable, decidability of reachability 
follows. 

We can allow the machines Ai © M\ and A2 © M2 to share a common 
input tape, i.e., each machine has a one-way read-only input head (see the 
paragraph preceding Theorem 7). A configuration ai will now be a 7-tuple 
oii = {si,Ui,qi,Vi,Wi,hi), hi is the position of the input head on the common 
input X. One can show that if both A\ © Mi and A2 © M2 have a one-turn stack 
(or an unrestricted counter), then reachability is undecidable, even if they have 
no reversal-bounded counters and the linear relation is ti =0- However, if only 
one of Ai © Mi and A2 © M2 has an unrestricted pushdown stack, then reach- 
ability is decidable. The idea is to construct a 5-tape PCA which, when given 
oi, /3i, «2j 1^2, X, simulates Mi and M2 in parallel on the input x, recording their 
running times and then check that the linear relation is satisfied. 

Note that the above results generalize to any number, fc, of machines Aj©Mi 
{i = 1, ...,k) operating in parallel. 

6 Conclusions 

We showed that a discrete timed automaton augmented with a machine with 
reversal-bounded counters and possibly other data structures from a class C of 
machines can be effectively analyzed with respect to reachability, safety, and 
other properties if C has a decidable emptiness problem. We gave examples of 
such C’s and examples of new properties of discrete timed automata that can be 
verified. We also showed that reachability in parallel machines can be effectively 
decided. It would be interesting to look for other classes of C’s with decidable 
emptiness problem. 
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Abstract. This article describes an algorithm for factorizing a finitely 
ambiguous finite-state transducer (FST) into two FSTs, Ti and T 2 , such 
that Ti is functional and T 2 retains the ambiguity of the original FST. 
The application of T 2 to the output of Ti never leads to a state that 
does not provide a transition for the next input symbol, and always 
terminates in a final state. In other words, T 2 contains no “failing paths” 
whereas Ti in general does. Since Ti is functional, it can be factorized 
into a left-sequential and a right-sequential FST that jointly constitute 
a bimachine. The described factorization can accelerate the processing 
of input because no failing paths are ever followed. 



1 Introduction 

An ambiguous finite-state transducer (FST) returns for every accepted input 
string one or more output strings by following different alternative paths from 
the initial state to a final state. In addition, there may be a number of other 
paths that are followed from the initial state up to a certain point where they 
fail. Following these latter paths is necessary but at the same time inefficient. 

We present an algorithm for factorizing (decomposing) a finitely ambiguou^ 
FST into two FSTs, Ti and T2, such that Ti is functional and T2 retains the 
ambiguity of the original FST. We call T2 fail-safe, meaning that its application 
to the output of Ti never leads to a state that does not provide a transition for 
the next input symbol, and always terminates in a final state. 

Because Ti is functional, it can be further factorized into a left-sequentia|3 
and a right-sequential FST, Tn and T12, that jointly constitute a bimachine 
as introduced by Schiitzenberger |^, using an existing factorization algorithm 
E07I . The resulting three FSTs, Tn, Ti2, and T2, are used in a cascade that 
simulates composition. 

^ Since infinite ambiguity, described by e-loops, usnally does not occur in practical 
applications, the limitation of the algorithm to finitely ambiguous FSTs does not 
constitute an obstacle in practice. 

^ The terms left- deterministic, left- sequential, etc. actually mean left-to-right- 
deterministic, left-to-right-sequential, etc. Similarly, right- deterministic means right- 
to-left- deterministic etc. 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. ITO- ITSTI 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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Another method for factorizing an ambiguous FST can be derived from a con- 
struction on automata by Schiitzenberger m. clearer described Iw Sakarovitch 
0 Sec. 3] in the framework of the so-called covering of automata]^ This factor- 
ization would yield a different result than the one described below. 

Factorization of FSTs can be useful for many practical applications, e.g. in 
Natural Language Processing where FSTs are used for many basic steps |4ii| . It 
can accelerate the processing of input because no time is spent on failing paths, 
and allows analyzing and manipulating separately the different parts of an FST 
(or of the described relation) . 



1.1 Conventions 

Every FST has one initial state, labeled with number 0, and one or more final 
states marked by double circles. An arc with n labels designates a set of n arcs 
with one label each that all have the same source and destination. In a symbol 
pair occurring as an arc label, the first symbol is the input and the second the 
output symbol. For example, in the pair a:b, a is the input and b the output 
symbol. Unpaired symbols represent identity pairs. For example, a means a: a. 



2 Basic Idea 

An FST can contain a number of failing paths for a given input string. The FST 
in Example E (Fig. 1) contains for the input string cabca two successful paths, 
formed by the ordered arc sets \101, 104, ^08, 112, 115~\ and \101, 104, 109, 113, 
115^^ respectively, and three failing paths, \100, 102, 105^^, \100, 102, 106^^, and 
\100, 103, 707]. For the string caba it has no successful and five failing paths, 
\100, 102, 105], \100, 102, 106], \100, 103, 107], \101, 104, 108], and \101, 
104, 109]. Following all failing paths is inevitable but inefficient. 




Any ambiguous e-free FST T can be factorized into two FSTs, Ti and T 2 , 
such that Ti is unambiguous and T 2 is fail-safe wrt. the output of Ti (Fig. 2). 

® Many thanks to Jacques Sakarovitch (CNRS and ENST, Paris) for pointing me to 
this work |1 OlSj and for explaining how Schiitzenberger’s construction can be made 
the principle step in the factorization of ambiguous FSTs. 
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Because of its structure, T 2 is called a flower transducer . Informally spoken, 
the factorization algorithm collapses a set of alternative sub-paths of T into one 
single sub-path in Ti, and expands it again in T 2 . When applied to an input 
string, Ti and T 2 operate as a cascade: T\ maps the input string to (at most) 
one intermediate string, and T 2 maps that string to a set of alternative output 
strings. 





cabba — > x^obbx — >■ { xxxxx, xxyyx, xyzyx } 
cabca — yzV-'icy — > { yzxxy, yzyyy } 




Fig. 2. Factorization of T into (a) a functional Ti and (b) an ambiguous fail-safe flower 
transducer T 2 (Example 0) 



The FST in Example ^ contains two ambiguity fields (Fig. 1). An ambiguity 
field is a maximal set of alternative subpaths that all accept the same substring 
in the same position of the same input strings. The first ambiguity field in Ex- 
ample ^spans from state 1 to 10, and maps the substring abb of the input string 
cabba to the set of alternative output substrings {xxx, xyy, yzy}. In Ti this 
ambiguity field is collapsed into a single subpath ranging from state 1 to 7 that 
maps the substring abb to ■i/'obb (Fig. 2a). T 2 maps this intermediate substring to 
the set of output substrings {xxx, xyy, yzy} by following the alternative subpaths 
\102, 105, 107], \102, 104 , 106], and \101, 103, 106] respectively (Fig. 2b). 
The second ambiguity field of Example 0 spans from state 5 to 11, and maps 
the substring be of the input string cabca to the set of output substrings jxx, 
yyj (Fig. 1). In Ti this ambiguity field is collapsed into a single subpath ranging 
from state 4 to 8 that maps be to ipic (Fig. 2a). T 2 maps the latter substring to 
the output substrings {xx, yyj by following the subpaths \108, 110] and \109, 
111] respectively (Fig. 2b). 

Note that in Ti a diacritic is used only on the first arc of a collapsed ambiguity 
field, and that the other arcs of the ambiguity field (usually) simply map an input 
symbol to itself. All symbols that are accepted outside an ambiguity field, are 
mapped in T\ to their final output which is then mapped to itself in T 2 , by an arc 
that loops on the initial state (Fig. 2). In the current example this loop consists 
of the arc 1 00 that is actually a set of three looping arcs with one symbol each 
(Fig. 2b). 

Ti, which is functional but not sequential, can be further factorized into a 
left-sequential and a right-sequential FST, Tu and Ti 2 , that jointly constitute a 
bimachine. The three FSTs, Tu, T 12 , and T 2 , together represent a factorization 
of T. The factorization from Example 0 is shown in Figure 3. When applied 
to an input string, the three FSTs operate as a cascade: Tu maps, e.g., the 
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■0obbx xxyyx, xyzyx } 

■itpicy — >■ { yzxxy, yzyyy } 



Fig. 3. Factorization of T into (a) a left-sequential Tn, (b) a right-sequential fail-safe 
Ti 2 , and (c) an ambiguous fail-safe T 2 (Example 01) 

input string cabca, deterministically from left to right, to the intermediate string 
cabcoi (Fig. 3a). T 12 maps then this string, deterministically from right to left, 
to yz^/>icy (Fig. 3b). Finally, T 2 maps that string, from left to right, to the set 
of alternative output strings {yzxxy, yzyyy} (Fig. 3c). In such a cascade, Tn 
and Ti 2 are sequential, and T 12 and T 2 are fail-safe wrt. the output of their 
predecessor. Input strings that are not accepted, fail in the first FST, Tn, on 
one single path, and require no further attention. 

3 Factorization Algorithm 

3.1 Starting Point 

The factorization of the ambiguous e-free FST in Example 0 (Fig. 4) requires 
identifying maximal sets of alternative arcs that must be collapsed in T\ and 
expanded again in T 2 . Two arcs are alternative wrt. each other if they are situated 
at the same position on two alternative paths that accept the same input string. 
This means the two arcs must have the same input symbol and equal sets of input 
prefixes and input suffixes. The two arcs 1 05 and 1 06 in Example |21 constitute 
such a maximal set of alternative arcs (Fig. 4). Both arcs have the input symbol 
b, the input prefix set |a*ab}, and the input suffix set |ca, cb, cc}. Two arcs 
are not alternative wrt. each other and must not be collapsed if they have either 
different input symbols, or no prefixes or no suffixes in common. 

In general, an FST can contain arcs where none of these two premises is true. 
In Example El the arcs 103 and lOf have identical input symbols, b, and equal 
input prefix sets, |a*a}, but their input suffix sets, {e, bca, bcb, bcc} and {bca, 
bcb, bcc} respectively, are neither equal nor disjoint (Fig. 4). These two arcs are 
only partially alternative which means they must be collapsed and not collapsed 
at the same time. To resolve this dilemma, the FST must be transformed (pre- 
processed) such that the sets of input prefixes and input suffixes of all arcs 
become either equal or disjoint, without changing the relation described by the 
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a”ab 

a”abbca 

a"abbcb 

a”abbc;c 



a”ac 



{ x"yx, x"zy } 

{ x"yxxxx, x"yyyyx } 
{ x"yxxyy, x"yyyxy } 
{ x” yxxyz, x” yyyxz } 
{ 1 



Fig. 4. Ambiguous e-free FST T (Example EJ 



3.2 Pre-processing 

The first step of the pre-processing consists of concatenating the FST on both 
sides with boundary symbols, and minimizing the result by means of stan- 
dard algorithms P] (Fig. 5). This operation “transfers” the properties of ini- 
tiality and finality from states to special arcs. Therefore, these properties will 
not require any attention in some subsequent operations. The result of the first 
pre-processing step will be referred to as minimal FST T^. 




The second step of the pre-processing consists of a left-unfolding of T™, 
which means that every state of T™ is split into a set of children states Qi. 
Each prefix of is inherited by only one qi, and each suffix by all of them. 
Consequently, different qi of the same g™ have disjoint prefix sets and equal 
suffix sets (Fig. 6). 

The operation is based on the left- deterministic input automaton of T™ 
which is obtained by extracting the input side from T™, and determinizing it 
from left to right (Fig. 6a). Every state of corresponds to a set of states of 
T™, and is assigned the set of corresponding state numbers (Fig. 5, 6a). Every 
state of T™ is copied to the left-unfolded FST T^ (Fig. 6b) as many times as it 
occurs in different state sets of A^. (The copying of the arcs is described later.) 
For example, state 8 of T™ occurs in the states sets of both state 2 and 5 of 
A^, and is therefore copied twice to T^, where the two copies have the state 
numbers 9 and 10. 

Every state q of T^ corresponds to one state g™ of T™ and to one state q^ 
of A^ . In the left-unfolded T^ of Example 0 every state is labeled with a triple 
of state numbers (g, (Fig. 6b). For example, states 9 and 10 are labeled 

with the triples (9, 8, 5) and (10, 8, 2) respectively which means that they are 
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Fig. 6. (a) Left-deterministic input automaton , built from T"*, and (b) left- 
unfolded FST (Example [3l 



both copies of state 8 (= g™) of T’" but correspond to different states of A^, 
namely to the states 5 and 2 {=q^) respectively. 

Every state q of inherits the full set of outgoing arcs of the corresponding 
state q^ of T’". For example, the set of outgoing arcs {101, 102, 103} of state 1 
(=?"*) of T"* is inherited by both state 1 and 2 (=q^) of where it becomes 
{102, 101, 103} and {105, 104, 106} respectively (Fig. 5, 6b). The destination 
of every arc of is determined by the destination states g™ and q^ of the 
corresponding arcs in T”^ and . For example, the arc 102 of must point 
to state 2, labeled with (2,1,2), because this (and only this) state corresponds 
to both the destination g’” = 1 of the corresponding arc 101 in T™ and the 
destination q^ = 2 of the corresponding arc 1 01 in A^ . 

The left-unfolded describes the same relation as T™. Minimizing 
would generate T"*. 

The third step of the pre-processing consists of a right-unfolding of the 
previously left-unfolded , which means that every state q of is split into 
a set of children states Each prefix of q is inherited by all qi, and each suffix 
by only one of them. Consequently, different qi of the same q have equal prefix 
sets and disjoint suffix sets (Fig. 7). 

The operation is based on the right- deterministic input automaton of the 

previously left-unfolded T^, and is performed exactly as the second step, except 
that is reversed before the operation, and reversed back afterwards. The 
reversal consists of making the initial state final and the only final state initial, 
and changing the direction of all arcs, without minimization or determinization 
that would change the structure of the FST. 

Every state q of the fully (i.e. left- and right-) unfolded FST (Fig. 7b) 
corresponds to one state q™ of T™, to one state q^ of A^, and to one state q^ of 
A^. In the fully unfolded of ExampleEl every state is labeled with a quadru- 
ple of state numbers {q, q'^ , q^ , q^) (Fig. 7b). For example, the states 11, 12, 13, 
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Fig. 7. (a) Right-deterministic input automaton built from the left-unfolded FST 
and (b) fully (i.e. left- and right-) unfolded FST (Example |2I) 



and 14 are labeled with the quadruples (11,8,5,2), (12,8,5,4), (13,8,2,4), and 
(14,8,2,2) which means that they are all copies of state 8 (= q^) of T™ but 
corresponds to different states of and A^. 

Every state q of has the same input prefix set V™{q) as the correspond- 
ing state q^ of A^ and the same input suffix set S™{q) as the corresponding 
state q^ of A^: 

Vq€Q : r"(q) =iP“(g^) (1) 

5“(q) = (2) 

Consequently, two states, qi and qj, of have equal input prefix sets iff they 
correspond to the same state q^ , and equal input suffix sets iff they correspond 
to the same state q^: 

yq^, q,&Q ■■ ^ gf = q^ (3) 

5“(g,)=5-(g,) ^ gf = gf (4) 

The input prefix and suffix sets of the states of are either equal or disjoint. 
Partial overlaps cannot occur. 

Equivalent states of are different copies of the same state g™ of T™. 
This means, two states, qi and qj, are equivalent iff they correspond to the same 
state g”" of T"*: 




Every arc a of the fully unfolded can be described by a quadruple: 



( 5 ) 
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a = (s, d, cr* 



^out\ 

) 



with aeA; s,deQ; a™ G T™; G (6) 



where s and d are the source and destination state, and (t“ and cr°”‘ the input 
and output symbol of the arc a respectively. For example, the arc 102 of 
can be described by the quadruple (1,4, a, y) (Fig. 7b). 

Alternative arcs describe alternative transductions of the same input symbol 
in the same position of the same input string. Two arc, Oi and aj, are alternative 
wrt. each other iff they have the same input symbol and equal input prefix and 
suffix sets. The input prefix set of an arc is the input prefix set of its source 
state, and the input suffix set of an arc is the input suffix set of its destination 
state: 



alt 






Equivalent arcs are different copies of the same arc of T™. Two arcs are 
equivalent iff they have the same input and output symbol, and equivalent source 
and destination states: 



a* = 



= a, : ^ (a“ = a“) A A (si = s,) A (d, = d,) (8) 



Two equivalent arcs are also alternative wrt. each other but not vice versa. 

The fully unfolded describes the same relation as T™ (Fig. 5). Mini- 
mizing would generate T™. The previous dilemma of collapsing partially 

alternative arcs does not occur in where arcs are never partially alternative 
wrt. each other. 



3.3 Construction of Factors 

After the pre-processing, preliminary factors, T( and T 2 , are built (Fig. 8) : First, 
all states of the unfolded are copied (as they are) to both T{ and T^. Then, 
all arcs of are grouped to disjoint maximal sets A of alternative arcs. For 
the current example (Fig. 7b), the sets are: 

{100}, {101, 105}, {102}, {103}, {104}, {106, 110}, {107}, {108}, {109}, 

{111, 127}, {112, 113}, {114, 129}, {115, 116}, {117, 120}, {118, 121}, 

{119, 122}, {123}, {124}, {125}, {126}, {128} 

Sets A of alternative arcs can have the following different locations wrt. 
ambiguity fields (cf. Sec.EI): 

• Singleton sets (e.g., {100} or {102} in Fig. 7b) and sets where all arcs are 
equivalent wrt. each other (no example in Fig. 7b) do not describe an ambi- 
guity. These arc sets are outside any ambiguity field. 

• All other arc sets A describe an ambiguity (e.g., {115, 116}). They are inside 
an ambiguity field where three different (possibly co-occurring) locations can 
be distinguished: 

o A is at the beginning of an ambiguity field iff the source states of all arcs in 
A are equivalent (e.g., {101, 105} and {112, 113}) : 

Begin{A) : \/ai,aj £ A : Si = sj (9) 
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o ^ is at the end of an ambiguity field iff the destination states of all arcs in 
A are equivalent (e.g., {ii7, 7^0} and {77^, 7^P}) : 

End{A) : Voi,OjG„4 : di = dj (10) 

o ^ is at an ambiguity fork, i.e., at a position where two or more ambiguity 
fields with a common (overlapping) beginning separate from each other, iff 
there is an arc in A and an arc ak outside A so that both have the same 
input symbol and equivalent source states but disjoint input suffix sets. This 
means that the state g™ of T™, that corresponds to the source states of both 
arcs, can be left via either arc, or o^, but one of the arcs is on a failing 
path, and therefore should not be taken (e.g., {777, 120} and |77iS, 7^7}) : 

Fork(A) 3at & A,3uk ^ A : (ff™ = (t“) A {si = Sk) A (5‘"(dj) ^ 5“(4)) (H) 





b 



b 





Fig. 8. Preliminary (non-minimal) factors, being (a) a functional FST T( and (b) an 
ambiguous fail-safe FST T} (Example 0 
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Every arc of the unfolded is represented in both T[ and T^. Arcs that 
are outside any ambiguity field are copied to T{ as they are wrt. their labels 
and source and destination states (Fig 8a). In T 2 they are represented by an arc 
looping on the initial state and labeled with the output symbol of the original arc 
(Fig 8b). This means, these unambiguous transductions of symbols are performed 
by T{, and T 2 simply accepts the output symbols by means of looping arcs. For 
example, arc 102 of labeled with a:y, is copied to T[ as it is, and a looping 
arc 100 labeled with y is created in T^. 

All arcs of a set A that is inside an ambiguity field are copied to both T{ 
and T 2 with their original location (regarding their source and destination) but 
with modified labels (Fig 8). They are copied to T[ with their common original 
input symbol cr“ and a common intermediate symbol (as output), and to 
T 2 with this intermediate symbol (as input) and their different original 

output symbols cr°“*. This causes the copy of the arc set A to collapse into one 
single arc after the minimization of T[ . The common intermediate symbol of all 
arcs in A can be a diacritic that is unique within the whole FST, i.e., that is not 
used for any other arc set. 

If there is concern about the size of Ti and T 2 and their alphabets, diacritics 
should be used sparingly. In this case, the choice of a common for a set A 
depends on the location of A wrt. an ambiguity field: 

• At the beginning of an ambiguity field, the common intermediate symbol 

is a diacritic that must be unique within the whole FST. For example, the arc 
set {112, 113} of gets the diacritic i/' 2 , he., the arcs change their labels 
from A={b :x, b :y} to Ai ={b :^ 2 , in T{ and to A2 ={V '2 : x, -ip2 :y} in 

T 2 . In addition, an e-arc is inserted from the initial state of T 2 to the source 
state of every arc in A, which causes the ambiguity field to begin at the initial 
state after minimization of T^. 

• At a fork position that does not coincide with the beginning of an ambiguity 
field, the common is a diacritic that needs to be unique only among all 
arc sets that have the same input symbol and the same input prefix set. This 
diacritic can be re-used with other forks. For example, the arc set {117, 120} 
gets the diacritic (j)o, i.e., the arcs change their labels from A—{c:x, c:y} to 
Ai={c:(j>o, c:0o} in T{ and to A2={0 q:x, (I)o-y} in T^- 

• In all other positions inside an ambiguity field, the common equals the 
common input symbol cr™ of all arcs in a set. For example, the arc set {115, 
116} gets the intermediate symbol b, i.e., the arcs change their labels from 
A={b:x, b:y} to Ai={b, b} in T{ and keep their labels in T!^, i.e., A 2 = A. 

• At the end of an ambiguity field, one of the above rules for intermediate sym- 
bols is applied. In addition, an e-arc is inserted in T 2 from the destination 
state of every arc in A to the final (= initial) state of T 2 , which causes the 
ambiguity field to end at the final state after minimization of T^. 

The final factors, Ti and T 2 , are obtained by replacing all boundary symbols, 

with e, and minimizing the preliminary factors, T[ and (Fig. 8, 9). T\ 
performs a functional transduction of every accepted input string by mapping 
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ab — t/'ob — { yx, zy } 

a” aab > x”xt/;ib > { x”xyx, x”xzy } 

”abbca > x”y'^; 2 b<iox >■ { x^yxxxx, x”yyyyx } 

"abbcb ^ y 4 > 2 b(piy > { x”yxxyy, x^yyyxy } 

’^abbcc > 7c^’y02b(^2Z )► { x”^yxxyz, x^yyjrxz } 

a”^ac — > x”zz — > { x”zz } 




Fig. 9. Final (minimal) factors, be- 
ing (a) a functional FST Ti and 
(b) an ambiguous fail-safe FST T 2 
(Example 2) 



every substring outside an ambiguity field to the corresponding unambiguous 
output, and every substring inside an ambiguity field to a unique intermediate 
substring. T2 maps the former substring to itself, and the latter to a set of 
alternative outputs. 



4 Final Remarks 

An FST can contain arcs with s (the empty string) on the input side, which is an 
obstacle for the above factorization. Input e-s can be removed by removing the 
£-arcs and concatenating their output symbols with the output of adjacent non- 
e-arcs. This classical method, however, cannot be applied to FSTs that accept 
e as input, mapping it to a non-empty string, or contain e-loops. An e on the 
output side can be handled like an ordinary symbol in factorization. 

If an FST contains arcs for the unknown symbol, denoted by “?”, in a location 
where a diacritic is required, factorization cannot be performed as described. For 
example, the arc set A—{?, ?:x} cannot be factorized into Ai 
and A2 ipi'.x}. The first arc in A must map a given unknown symbol to 

itself which is not possible when it is factorized. The first arc in A\ would map 
an unknown symbol to ipi] the first arc in A2, however, could not map ipi to 
the same unknown symbol that occurred in the input, without using additional 
memory (and a special mechanism) at runtime. To solve this conflict, A can 
be factorized into two sets of arc sequences. For example, _ 4 ={?, ?:x} can be 
factorized into Ai={\e\il)i, ?], [eiV’i, ?!} and A2={\4>i'-£> ? = 

An auxiliary state is added inside every arc sequence. 

The above factorization can create some redundant intermediate diacritics. 
A diacritic that always “co-occurs” in T2 with another diacritic can be replaced 
in Ti and T2 by the latter without affecting the overall relation 0 . This reduces 
the size of the intermediate alphabet and, after minimization, the size of T2. We 
mean by co-occurrence of two or more diacritics, that the arcs that are labeled 
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on the input side with these diacritics, have the same source, destination, and 
output symbol. In Example El (Fig. 9), can be replaced by -ipo, and 4>2 by (pi. 

The algorithm described in this article has been implemented. Future re- 
search will include an experimental evaluation of the efficiency gain when pro- 
cessing input strings. 
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Abstract. The Mona tool provides an implementation of the decision 
procedures for the logics WSIS and WS2S. It has been used for numer- 
ous applications, and it is remarkably efficient in practice, even though it 
faces a theoretically non-elementary worst-case complexity. The imple- 
mentation has matured over a period of six years. Compared to the first 
naive version, the present tool is faster by several orders of magnitude. 
This speedup is obtained from many different contributions working on 
all levels of the compilation and execution of formulas. We present a se- 
lection of implementation “secrets” that have been discovered and tested 
over the years, including formula reductions, DAGification, guided tree 
automata, three-valued logic, eager minimization, BDD-based automata 
representations, and cache-conscious data structures. We describe these 
techniques and quantify their respective effects by experimenting with 
separate versions of the Mona tool that in turn omit each of them. 



1 Introduction 

Mona is an implementation of the decision procedures for the logics 

WSIS and WS2S j2H]. They have long been known to be decidable |7ISI I If j . but 
with a non-elementary lower bound izq. For many years it was assumed that 
this discouraging complexity precluded any useful implementations. 

However, Mona has been developed at BRIGS since 1994, when our initial 
attempt at automatic pointer analysis through automata calculations took four 
hours to complete. Today Mona has matured into an efficient and popular tool 
on which the same analysis is performed in a couple of seconds. Through those 
years, many different approaches have been tried out, and a good number of 
implementation “secrets” have been discovered. This paper describes the most 
important tricks we have learned, and it tries to quantify their relative merits 
on a number of benchmark formulas. 

Of course, the resulting tool still has a non-elementary worst-case complexity. 
Perhaps surprisingly, this complexity also contributes to successful applications, 
since it is provably linked to the succinctness of the logics. If we want to describe 
a particular regular set, then a WSIS formula may be non-elementarily more 
succinct that a regular expression or a transition table. 

The niche for Mona applications contains those structures that are too large 
and complicated to describe by other means, yet not so large as to require in- 
feasible computations. Happily, many interesting projects fit into this niche, 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 182- IT5fl 2001. 
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including hardware verification m. pointer analysis cnna, controller synthe- 
sis Egni, natural languages m. parsing tools Presburger arithmetic m, 
and verification of concurrent systems 






2 MONA, WSIS, and WS2S 



The first versions of Mona were based on a logic about finite strings, the 
monadic second-order logic M2L(Str). In this notation, first-order variables are 
interpreted over the positions in a finite string. Thus, for a given interpretation, 
there is a maximum value that a first-order variable may take on. A second- 
order variable denotes a subset of positions. A formula is valid if it holds for any 
finite string. The decision procedure for this logic is slightly easier to implement 
than that of the current Mona, which is based on WSIS. This logic is simpler 
to explain: a first-order variable denotes a natural number, and a second-order 
variable denotes a finite set of numbers. Both logics allow the comparison of 
variables in the expected ways depending on their order: <, C, =, £, etc. Also, 
a function symbol -1-1 is allowed on first-order terms. It denotes the successor 
(where in the case of M2L(Str), the successor of the last position is defined as 
the first position). 

Mona additionally supports the logic WS2S with two successors. Also, there 
is explicit syntax for Presburger constants. Finally, it implements the variation 
WSRT which allows values of recursive data types rather than simply binary 
trees. The Mona manual P(| describes the syntax and semantics of the Mona 
language and the features of the tool. 

The automaton for a formula is constructed recursively from automata rep- 
resenting subformulas. In the cases of M2L(Str) and WSIS, each automaton 
describes a language of strings over the alphabet {0, 1}^, where k is the num- 
ber of free variables in the subformula. Each string represents an interpretation, 
that is, an assignment of values to variables that are free in the subformula; the 
language is the set of strings that define satisfying interpretations. For WS2S 
this is generalized to tree automata. 



3 Benchmark Formulas 



The experiments presented in the following section are based on twelve bench- 
mark formulas, here shown with their sizes, the logics they are using, and their 
time and space consumptions when processed by Mona 1.4 (on a 296MHz Ul- 
traSPARC with 1GB RAM): 
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Benchmark 


Name 


Size 


Logic 


Time 


Space 


A 


df lipflop.mona 


2 KB 


WSIS (M2L-Str) 


0.4 sec 


3 MB 


B 


euclid.mona 


6 KB 


WSIS (Presburger) 


33.1 sec 


217 MB 


C 


f ischerjQutex.mona 


43 KB 


WSIS 


15.1 sec 


13 MB 


D 


html3_grammar . mona 


39 KB 


WS2S (WSRT) 


137.1 sec 


208 MB 


E 


lif t_controller .mona 


36 KB 


WSIS 


8.0 sec 


15 MB 


F 


mcnc91_bbsse .mona 


9 KB 


WSIS 


13.2 sec 


17 MB 


G 


reverse_linear .mona 


11 KB 


WSIS (M2L-Str) 


3.2 sec 


4 MB 


H 


sear ch_tr ee . mona 


19 KB 


WS2S (WSRT) 


30.4 sec 


5 MB 


I 


sliding_window.mona 


64 KB 


WSIS 


40.3 sec 


59 MB 


J 


szymanski_acc . mona 


144 KB 


WSIS 


20.6 sec 


9 MB 


K 


von_neumann_adder . mona 


5 KB 


WSIS 


139.9 sec 


116 MB 


L 


xbar .theory . mona 


14 KB 


WS2S 


136.4 sec 


518 MB 



The benchmarks have been picked from a large variety of Mona applications 
ranging from hardware verification to encoding of natural languages. 

df lipflop.mona - a verification of a D-type flip-flop circuit P|. Provided by Ab- 
del Ayari. 

euclid.mona - an encoding in Presburger arithmetic of six steps of reachability 
on a machine that implements Euclid’s GCD algorithm m- Provided by 
Tom Shiple. 

f ischer_mutex .mona and lift_controller .mona — duration calculus encodings of 
Fischer’s mutual exclusion algorithm and a mine pump controller, translated 
to Mona code m- Provided by Paritosh Pandya. 
html3_grammar .mona - a tree-logic encoding of the HTML 3.0 grammar annotated 
with 10 parse-tree formulas m- Provided by Niels Damgaard. 
reverse_linear .mona - verifies correctness of a C program reversing a pointer- 
linked list m- 

search_tree .mona - verifies correctness of a C program deleting a node from a 
search tree m- 

sliding_window.mona - verifies correctness of a sliding window network proto- 
col |2Z!- Provided by Mark Smith. 

szymanski_acc .mona - validation of the parameterized Szymanski problem using 
an accelerated iterative analysis 0. Provided by Mamoun Filali-Amine. 
von_neumann_adder . mona and mcnc91_bbsse .mona — verification of sequential hard- 
ware circuits; the first verifies that an 8-bit von Neumann adder is equiv- 
alent to a standard carry-chain adder, the second is a benchmark from 
MCNC91 Provided by Sebastian Modersheim. 
xbar_theory .mona - encodes a part of a theory of natural languages in the Chom- 
sky tradition. It was used to verify the theory and led to the discovery of 
mistakes in the original formalization m- Provided by Frank Morawietz. 

We will use these benchmarks to illustrate the effects of the various implemen- 
tation “secrets” by comparing the efficiency of Mona shown in the table above 
with that obtained by handicapping the Mona implementation by not using the 
techniques. 
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4 Implementation Secrets 

The Mona implementation has been developed and tuned over a period of six 
years. Many large and small ideas have contributed to a combined speedup of 
several orders of magnitude. Improvements have taken place at all levels, which 
we illustrate with the following seven examples from different phases of the 
compilation and execution of formulas. 

To enable comparisons, we summarize the effect of each implementation “se- 
cret” by a single dimensionless number for each benchmark formula. Usually, 
this is simply the speedup factor, but in some cases where the numerator is not 
available, we argue for a more synthetic measure. If a benchmark cannot run on 
our machine, it is assigned time oo. 



4.1 Eager Minimization 

When Mona inductively translates formulas to automata, a Myhill-Nerode mini- 
mization is performed after every product and projection operation. Naturally, it 
is preferable to operate with as small automata as possible, but our strategy may 
seem excessive since minimization often exceeds 50% of the total running time. 
This suspicion is strengthened by the fact that Mona automata by construc- 
tion contain only reachable states; thus, minimization only collapses redundant 
states. 

Three alternative strategies to the eager one currently used by Mona would 
be to perform only the very final minimization, only the ones occurring after 
projection operations, or only the ones occurring after product operations. Many 
other heuristics could of course also be considered. The following table results 
from such an investigation: 



Benchmark 


Time 


Effect 


Only final 


After project 


After product 


Always 


A 


CX) 


oo 


0.6 sec 


0.4 sec 


1.5 


B 


oo 


oo 


oo 


33.1 sec 


oo 


C 


oo 


oo 


32.3 sec 


15.1 sec 


2.1 


D 


oo 


oo 


290.6 sec 


137.1 sec 


2.1 


E 


oo 


oo 


19.4 sec 


8.0 sec 


2.4 


F 


oo 


oo 


36.7 sec 


13.2 sec 


2.8 


G 


oo 


oo 


5.8 sec 


3.2 sec 


1.8 


H 


oo 


oo 


59.6 sec 


30.4 sec 


2.0 


I 


oo 


oo 


74.4 sec 


40.3 sec 


1.8 


J 


oo 


oo 


36.3 sec 


20.6 sec 


1.8 


K 


oo 


oo 


142.3 sec 


139.9 sec 


1.0 


L 


oo 


oo 


oo 


136.4 sec 


oo 



“Only final” is the running time when minimization is only performed as the final 
step of the translation; “After project” is the running time when minimization is 
also performed after every projection operation; “After product” is the running 
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time when minimization is instead performed after every product operation; 
“Always” is the time when minimization is performed eagerly; and “Effect” is 
the “After product” time compared to the “Always” time (since the other two 
strategies are clearly hopeless). Eager minimization is seen to be always beneficial 
and in some cases essential for the benchmark formulas. 

4.2 Guided Tree Automata 

Tree automata are inherently more computationally expensive because of their 
three-dimensional transition tables. We have used a technique of factorization of 
state spaces to split big tree automata into smaller ones. The basic idea, which 
may result in exponential savings, is explained in 0. To exploit this feature, 
the Mona programmer must manually specify a guide, which is a top-down 
tree automaton that assigns state spaces to the nodes of a tree. However, when 
using the WSRT logic, a canonical guide is automatically generated. For our two 
WSRT benchmarks, we measure the effect of this canonical guide: 



Benchmark 


Without guide 


With guide 


Effect 


D 


584.0 sec 


137.1 sec 


4.3 


H 


00 


30.4 sec 


CX3 



“Without guide” shows the running time without any guide, while “With guide” 
shows the running time with the canonical WSRT guide; “Effect” shows the 
“Without guide” time compared to the “With guide” time. We have only a small 
sample space here, but clearly guides are very useful. This is hardly surprising, 
since they may yield an asymptotic improvement in running time. 

4.3 Cache-Conscious Data Structures 

The data structure used to represent the HDDs for transition functions has been 
carefully tuned to minimize the number of cache misses that occur. This effort is 
motivated in earlier work HH!, where it is determined that the number of cache 
misses during unary and binary HDD apply steps totally dominates the running 
time. 

In fact, we argued elsewhere HS| that if A1 is the number of unary apply 
steps and A2 is the number of binary apply steps, then there exists constant m, 
Cl, and C 2 such that the total running time is approximately m(ci • AI-I-C 2 • A2). 
Here, m is the machine dependent delay incurred by an L2 cache miss, and ci and 
C 2 are the average number of cache misses for unary and binary apply steps. This 
estimate is based the assumption that time incurred for manipulating auxiliary 
data structures, such as those used for describing subsets in the determinization 
construction, is insignificant. For the machine we have used for experiments, it is 
by a small C utility determined that m = 0.43/rs. In our HDD implementation, 
explained in we have estimated from algorithmic considerations that ci = 
1.7 and C 2 = 3 (the binary apply may entail the use of unary apply steps for 
doubling tables that were too small — these steps are not counted towards the 
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time for binary apply steps, and that is why we can use the figure ci = 3); 
we also estimated that for an earlier conventional implementation, the numbers 
were ci = 6.7 and Ci = 7.3. The main reason for this difference is that our 
specialized package stores nodes directly under their hash address to minimize 
cache misses; traditional BDD packages store BDD nodes individually with the 
hash table containing pointers to them — roughly doubling the time it takes to 
process a node. We no longer support the conventional BDD implementation, so 
to measure the effect of cache-consciousness, we must use the above formula to 
estimate the running times that would have been obtained today. 

In the following experiment, we have instrumented Mona to obtain the exact 
numbers of apply steps: 



Benchmark 


Applyl 


Apply2 


Misses 


Auto 


Predicted 


Conventional 


Effect 


A 


183,949 


28,253 


397,472 


0.2 sec 


0.2 sec 


0.6 sec 


3.0 


B 


21,908,722 


3,700,856 


48,347,395 


32.8 sec 


20.8 sec 


74.7 sec 


3.6 


C 


24,585,292 


1,428,381 


46,080,139 


14.2 sec 


19.8 sec 


75.2 sec 


3.8 


E 


9,847007 


822,796 


19,208,299 


7.7 sec 


8.2 sec 


30.9 sec 


3.8 


F 


13,406,047 


5,717,453 


39,942,638 


12.8 sec 


17.2 sec 


56.6 sec 


3.3 


G 


233,566 


54,814 


561,504 


0.5 sec 


0.3 sec 


0.8 sec 


2.7 


I 


36,629,195 


11,153,733 


95,730,831 


37.0 sec 


41.2 sec 


140.5 sec 


3.4 


J 


10,497,759 


2,257,791 


24,619,563 


11.6 sec 


10.6 sec 


37.3 sec 


3.5 


K 


129,126,447 


10,485,623 


250,971,828 


137.4 sec 


107.9 sec 


404.7 sec 


3.8 



“Applyl” is the number of unary apply steps; “Apply2” is the number of binary 
apply steps; “Misses” is the number of cache misses predicted by the formula 
above; “Auto” is the part of the actual running time involved in automata con- 
structions; “Predicted” is the running time predicted from the cache misses 
alone; “Conventional” is the predicted running time for a conventional BDD 
implementation that was not cache-conscious; and “Effect” is “Conventional” 
compared to “Predicted” . In most cases, the actual running time is close to the 
predicted one (within 25%) . Note that there are instances where the actual time 
is about 50% larger than the estimated time: benchmark B involves a lengthy 
subset construction on an automaton with small BDDs — thus it violates the 
assumption that the time handling accessory data structures is insignificant; 
similarly, benchmark G also consists of many automata with few BDD nodes 
prone to violating the assumption. 

In an independent comparison m it was noted that Mona was consistently 
twice as fast as a specially designed automaton package based on a BDD package 
considered efficient. In m, the comparison to a traditional BDD package yielded 
a factor 5 speedup. 



4.4 BDD-Based Automata Representation 

Its reasonable to ask: “What would happen if we had simply represented the 
transition tables in a standard fashion, that is, a row for each state and a col- 
umn for each letter?” . Under this point of view, it makes sense to define a letter 
for each bit-pattern assignment to the free variables of a subformula (as opposed 
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to the larger set of all variables bound by an outer quantifier). We have instru- 
mented Mona to measure the sum of the number of entries of all such automata 
transition tables constructed during a run of a version of Mona without BDDs: 



Benchmark 


Misses 


Table entries 


Effect 


A 


397,472 


237,006 


0.6 


B 


48,347,395 


2,973,118 


0.1 


C 


46,080,139 


1,376,499,745,600 


29,871.9 


E 


19,208,299 


290,999,305,488 


15,149.7 


F 


39,942,638 


2,844,513,432,416,357,974,016 


71,214,961,626,128.9 


G 


561,202 


912,194 


1.6 


I 


95,730,831 


116,387,431,997,281,136 


1,215,777,934.7 


J 


24,619,563 


15,424,761,908 


626.5 


K 


250,971,828 


2,544,758,557,238,438 


10,139,618.4 



“Misses” is again the number of cache misses in our BDD-based implementation, 
and “Table entries” is the total number of table entries in the naive implementa- 
tion. To roughly estimate the effect of the BDD-representation, we conservatively 
assume that each table entry results in just a single cache miss; thus, “Effect” 
compares “Table entries” to “Misses”. The few instances where the effect is less 
than one correctly identify benchmark formulas where the BDDs are less nec- 
essary, but are also artifacts of our conservative assumption. Conversely, the 
extremely high effects are associated with formulas that could not possibly be 
decided without BDDs. Of course, the use of BDD-structures completely dom- 
inates all other optimizations, since no implementation could realistically be 
based on the naive table representation. 

The BDD-representation was the first breakthrough of the Mona implemen- 
tation, and the other “secrets” should really be viewed with this as baseline. The 
first implementation did not actually use tables but a conjunctive normal form. 
Nevertheless, the effect of switching to BDDs was stunning. 



4.5 DAGification 

Internally, Mona is divided into a front-end and a back-end. The front-end parses 
the input and builds a data structure representing the automata-theoretic opera- 
tions that will calculate the resulting automaton. The back-end then inductively 
carries out these operations. 

The generated data structure is often seen to contain many common sub- 
formulas. This is particularly true when they are compared relative to signature 
equivalence, which holds for two formulas (j) and (j)' if there is an order-preserving 
renaming of the variables in </> (increasing with respect to the indices of the vari- 
ables) such that the representations of 4> and </>' become identical. 

A property of the BDD representation is that the automata corresponding to 
signature-equivalent trees are isomorphic in the sense that only the node indices 
differ. This means that intermediate results can be reused by simple exchanges of 
node indices. For this reason, Mona represents the formulas in a DAG (Directed 
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Acyclic Graph), not a tree. The DAG is conceptually constructed from the tree 
using a bottom-up collapsing process, based on the signature equivalence relation 
as described in El. 

Glearly, constructing the DAG instead of the tree incurs some overhead, but 
the following experiments show that the benefits are significantly larger: 



Benchmark 


Nodes 


Time 


Effect 


Tree 


DAG 


Tree 


DAG 


A 


2,532 


296 


1.7 sec 


0.4 sec 


4.3 


B 


873 


259 


79.2 sec 


33.1 sec 


2.4 


G 


5,432 


461 


40.1 sec 


15.1 sec 


2.7 


D 


3,038 


270 


CX) 


137.1 sec 


CX) 


E 


4,560 


505 


20.5 sec 


8.0 sec 


2.6 


F 


1,997 


505 


49.1 sec 


13.2 sec 


3.7 


G 


56,932 


1,199 


CX) 


3.2 sec 


CX) 


H 


8,180 


743 


CX) 


30.4 sec 


CX) 


I 


14,058 


1,396 


107.1 sec 


40.3 sec 


2.7 


J 


278,116 


6,314 


CX) 


20.6 sec 


CX) 


K 


777 


273 


284.0 sec 


139.9 sec 


2.0 


L 


1,504 


388 


CX) 


136.4 sec 


CX) 



“Nodes” shows the number of nodes in the representation of the formula. “Tree” 
is the number of nodes using an explicit tree representation, while “DAG” is 
the number of nodes after DAGification. “Time” shows the running times for 
the same two cases. “Effect” shows the “Tree” running time compared to the 
“DAG” running time. The DAGification is seen to provide a substantial and 
often essential gain in efficiency. 

The effects reported sometimes benefit from the fact that the restriction 
technique presented in the following subsection knowingly generates redundant 
formulas. This explains some of the failures observed. 



4.6 Three- Valued Logic and Automata 

In earlier versions of Mona, we struggled with the issue of encoding first-order 
variables as second-order variables — that’s the standard technique for monadic 
second-order logics, but it raises the issue of restrictions: the common phe- 
nomenon that a formula (f> makes sense, relative to some exterior conditions, 
only when an associated restriction holds. The restriction is also a formula, and 
the main issue is that (j) is now essentially undefined outside the restriction. 
Later, when we chose to base Mona on WSIS instead of a monadic second- 
order logic on strings, we sometimes encountered state space explosions when 
we constrained variables in order to emulate the string-based semantics. 

The nature of these problems is very technical, but fortunately they can be 
solved through a theory of restriction couched in a three- valued logic m- Under 
this view, a restricted subformula (j) is associated with a restriction (fn different 
from true] an unrestricted formula is associated with a restriction (fn that is 
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true. We do not outline the theory of restrictions here, except for noting that 
restriction of a conjunction (or disjunction) is the conjunction of the restrictions 
of the conjuncts (or disjuncts). 

According to m, we can guarantee that the WSIS framework handles all 
formulas written in the earlier string logic, even with intermediate automata that 
are no bigger than when run through the original decision procedure. Also, the 
running time of the original procedure may be asymptotically worse than with 
the WSIS formulation. Unfortunately, there is no way of disabling this feature 
to provide a quantitative comparison. 

4.7 Formula Reductions 

Formula reduction is a means of “optimizing” the formulas in the DAG before 
translating them into automata. The reductions are based on a syntactic analysis 
that attempts to identify valid subformulas and equivalences among subformulas. 

There are some non-obvious choices here. How should computation resources 
be apportioned to the reduction phase and to the automata calculation phase? 
Must reductions guarantee that automata calculations become faster? Should the 
two phases interact? Our answers are based on some trial and error along with 
some provisions to cope with subtle interactions with other of our optimization 
secrets. 

Mona 1.4 performs three kinds of formula reductions: 1) simple equality and 
boolean reductions, 2) special quantifier reductions, and 3) special conjunction 
reductions. The first kind can be described by simple rewrite rules (only some 
typical ones are shown): 

X = X true (j) A (j) (p 

true A (f> (j) -w (p 

false A (f> false -ifalse true 

These rewrite steps are guaranteed to reduce complexity, but will not cause sig- 
nificant improvements in running time, since they all either deal with constant 
size automata or rarely apply in realistic situations. Nevertheless, they are ex- 
tremely cheap, and they may yield small improvements, in particular on machine 
generated Mona code. 

The second kind of reductions can potentially cause tremendous improve- 
ments. The non-elementary complexity of the decision procedure is caused by 
the automaton projection operations, which stem from quantifiers. The accom- 
panying determinization construction may cause an exponential blow-up in au- 
tomaton size. Our basic idea is to apply a rewrite step resembling Zet-reduction, 
which removes quantifiers: 

3X : (j) (p[T/X] provided that tp ^ X = T is valid, and T is some 

term satisfying FV(T) C FV{(p) 

where FV{-) denotes the set of free variables. For several reasons, this is not 
the way to proceed in practice. First of all, finding terms T satisfying the side 
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condition can be an expensive task, in worst case non-elementary. Secondly, the 
translation into automata requires the formulas to be “flattened” by introduc- 
tion of quantifiers such that there are no nested terms. So, if the substitution 
</)[T/Ai] generates nested terms, then the removed quantifier is recreated by the 
translation. Thirdly, when the rewrite rule applies in practice, (j) usually has a 
particular structure as reflected in the following more restrictive rewrite rule 
chosen in Mona: 

3X : (j) -w (j)[Y/X] provided that 4> = ■ ■ ■ f\ X = Y f\ ■ ■ ■ , and Y is some 
variable other than X 

In contrast to equality and boolean reductions, this rule is not guaranteed to 
improve performance, since substitutions may cause the DAG reuse degree to 
decrease. 

The third kind of reductions applies to conjunctions, of which there are two 
special sources. One is the formula flattening just mentioned; the other is the 
formula restriction technique mentioned in Section f4. PI Both typically introduce 
many new conjunctions. Studies of a graphical representation of the formula 
DAGs (Mona can create such graphs automatically) led us to believe that many 
of these new conjunctions are redundant. A typical rewrite rule addressing such 
redundant conjunctions is the following: 

(j)i A (f >2 4>i provided that unrestr{4>2) Q unrestr{(j)i) U restr{(f>i) 

and restr{(j) 2 ) C restr{(pi) 

Here, unrestr{4>) is the set of unrestricted conjuncts in <(), and restr{4>) is the set 
of restricted conjuncts in <p- This reduction states that it is sufflcient to assert 
4>i when 4>i A (j )2 was originally asserted in situations where the unrestricted 
conjuncts of 4>2 are already conjuncts of 4>i — whether restricted or not — and the 
restricted conjuncts of (j )2 are unrestricted conjuncts of (j)i. It is not sufflcient 
that they be restricted conjuncts of 4>i, since the restrictions may not be the 
same in (j)i. 

With the DAG representation of formulas, the reductions just described can 
be implemented relatively easily in Mona. The table below shows the effects of 
performing the reductions on the benchmark formulas: 



Benchmark 


Hits 


Time 


Effect 


Simple 


Quant. 


Conj. 


None 


Simple 


Quant. 


Conj. 


All 


a 


12 


8 


22 


0.8 sec 


0.7 sec 


0.7 sec 


0.7 sec 


0.4 sec 


2.0 


B 


10 


45 


0 


58.2 sec 


58.8 sec 


56.2 sec 


56.8 sec 


33.1 sec 


1.8 


C 


9 


13 


8 


43.7 sec 


41.9 sec 


37.1 sec 


42.9 sec 


15.1 sec 


2.9 


D 


4 


28 


27 


542.7 sec 


536.1 sec 


296.0 sec 


404.7 sec 


137.1 sec 


4.0 


E 


5 


6 


19 


22.6 sec 


23.4 sec 


16.6 sec 


22.7 sec 


8.0 sec 


2.8 


F 


3 


1 


1 


28.3 sec 


29.9 sec 


27.0 sec 


27.2 sec 


13.2 sec 


2.1 


G 


65 


318 


191 


6.1 sec 


5.9 sec 


6.1 sec 


5.9 sec 


3.2 sec 


1.9 


H 


35 


32 


81 


104.1 sec 


102.6 sec 


71.0 sec 


98.5 sec 


30.4 sec 


3.4 


I 


102 


218 


7 


76.2 sec 


76.5 sec 


75.0 sec 


76.0 sec 


40.3 sec 


1.9 


J 


91 


0 


1 


37.3 sec 


37.9 sec 


37.6 sec 


37.0 sec 


20.6 sec 


1.9 


K 


9 


4 


1 


313.7 sec 


267.9 sec 


240.3 sec 


302.6 sec 


139.9 sec 


2.3 


L 


4 


4 


18 


oo 


CXD 


OO 


OO 


136.4 sec 


OO 



192 



N. Klarlund, A. M0ller, and M.I. Schwartzbach 



“Hits” shows the number of times each of the three kinds of reduction is per- 
formed; “Time” shows the total running time in the cases where no reductions 
are performed, only the first kind of reductions are performed, only the second, 
only the third, and all of them together. “Effect” shows the “None” times com- 
pared to the “All” times. All benchmarks gain from formula reductions, and in 
a single example this technique is even necessary. Note that most often all three 
kinds of reductions must act in unison to obtain significant effects. 

A general benefit from formula reductions is that tools generating Mona for- 
mulas from other formalisms may generate naive and voluminous output while 
leaving optimizations to Mona. In particular, tools may use existential quanti- 
fiers to bind terms to fresh variables, knowing that Mona will take care of the 
required optimization. 



5 Future Developments 

Several of the techniques described in the previous section can be further refined 
of course. The most promising ideas seem however to concentrate on the HDD 
representation. In the following, we describe three such ideas. 

It is a well-known fact |B| that the ordering of variables in the HDD automata 
representation has a strong influence on the number of HDD nodes required. 
The impact of choosing a good ordering can be an exponential improvement in 
running times. Finding the optimal ordering is an NP-complete problem, but we 
plan to experiment with the heuristics that have been suggested jO]. 

We have sometimes been asked: “Why don’t you encode the states of the 
automata in HDDs, since that is a central technique in model checking?” . The 
reason is very clear: there is no obvious structure to the state space in most 
cases that would lend itself towards an efficient HDD representation. For ex- 
ample, consider the consequences of a subset construction or a minimization 
construction, where similar states are collapsed; in either case, it is not obvious 
how to represent the new state. However, the ideas are worth investigating. 

For our tree automata, we have experimentally observed that the use of 
guides produce a large number of component automata many of which are almost 
identical. We will study how to compress this representation using a BDD-like 
global structure. 



6 Conclusion 

The presented techniques reflect a lengthy Darwinian development process of the 
Mona tool in which only robust and useful ideas have survived. We have not 
mentioned here the many ideas that failed or were surpassed by other techniques. 
Our experiences confirm the maxim that optimizations must be carried out at all 
levels and that no single silver bullet is sufficient. We are confident that further 
improvements are still possible. 
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Many people have contributed to earlier versions of Mona, in particular we 

are grateful to David Basin, Morten Biehl, Jacob Elgaard, Jesper Gulmann, Ja- 
cob Jensen, Michael Jprgensen, Bob Paige, Theis Rauhe, and Anders Sandholm. 

We also thank the Mona users who kindly provided the benchmark formulas. 
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Abstract. The iterator concept is becoming the fundamental abstrac- 
tion of reusable software and the key to modularity and clean code es- 
pecially in object-oriented languages like C-|— I- and Java. They serve as 
accessors to a sequence hiding the implementation details from the algo- 
rithm and their encapsulation power allows true generic programming. 
The Standard Template Library defines clearly their behavior on simple 
sequences like linked lists or vectors. In this paper, we define the concept 
of cursor which can be seen as a generalization of the iterator concept 
to more complex data structures than sequences, in this case acyclic 
automata. We show how elegant and efficient they can be on applica- 
tions written in C-|— I- and based on the Automaton Standard Template 
Library. 



1 Introduction 

Cursors introduce a software layer between the deterministic finite automaton 
classes of the Automaton Standard Template Library (ASTL, an automaton 
library written in C-|— I- imnH]) and the algorithms. Likewise the iterator con- 
cept, the cursor concept serves two purposes: making the algorithm undependant 
from the automaton structure and removing some of the algorithmic responsi- 
bilities from the algorithm core. It must be regarded as a generalization of the 
iterator concept: the main difference is that when iterating on data, a cursor 
needs to know which path to follow whereas an iterator knows of only one. 
Therefore, assigning a traversal algorithm to a cursor makes it an iterator and 
a cursor can be built from an iterator too. 

This obviously implies that these objects have a well-defined, consistent and 
simple behavior to rely on. Moreover, generic programming standards impose 
tough efficiency constraints one has to comply to. 

This paper introduces the concept of cursor on acyclic automata. We first present 
a few definitions and programming abstractions and then expose the main ob- 
stacles we met using ASTL that required the introduction of three major models 
of cursor: forward, stack and depth-first cursors. We then discuss the issue of 
applying algorithms and show how easy and straightforward it is to combine 
them to create more powerful algorithms. 

* Supported by Lexiquest SA, http://www.lexiquest.com/ 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 195- E(T71 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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2 Definitions 

2.1 Deterministic Finite Automaton 

To allow more flexibility, we add to the classical DFA definition a set of tags, 
that is any data needed to apply an algorithm. The set r maps each state to a 
tag. 

Let A{U,Q,i, F, A,T,t) be a 7-uple defined as follow: S is the alphabet, Q a 
set of states, i £ Q the initial state, F C Q a set of final (accepting) states, 
Z\ C Q X A X Q a set of transitions, T a set of tags, t C Q x T a relation from 
states to tags. 

We distinguish one special state noted 0 and called the null or sink state. For 
every automaton A{S ,Q,i, F, A,T,t) we have 0 G Q- The language of a DFA 
is written L{A). 

We define P{X) as the power-set of a set X. 



2.2 Access Functions and Sink Transitions 

To access A we define two transition functions and S 2 : 



Vq £ Q,Va £ X, Si(q,a) = 



Si : Qx 

p if (q,a,p) e ^ 

0 otherwise 



retrieves a transition target given the source state and a letter. In the case of 
undefined transitions the result is the null (sink) state. 

A transition verifying the following property is called a sink transition: 

{q, a) £ Q X S , q = 0 or Si{q, a) = 0 

S 2 retrieves the set of all outgoing transitions of a source state allowing thus its 
traversal : 



S 2 ■ Q ^ P{X X Q) 

Vg G Q, S 2 {q) = {{a,p) £ X x Q such that (q,a,p) £ A} 

2.3 Concepts and Models 

We call a concept a set of requirements on a type covering three aspects: 

1. The interface (the methods signatures). 

2. The methods semantics (the behavior). 

3. The methods complexities (the computing time). 

We call model of a concept a type of object conforming to this concept. Concepts 
are usually abstract classes and models are concrete types implementing the 
concept. For instance, a C-like pointer is a model of iterator: it provides an 
operator [ ] returning a reference on the i”*^ element of a sequence in constant 
time. 
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2.4 Object Properties 

1. An object has a singular value when it only guarantees assignment operation 
and results of most expressions are undefined. For example, an uninitialized 
pointer has a singular value. 

2. We say that an object x is assignable iff it defines an operator = allowing 
assignment from an object y of the same type. The postcondition of the 
assignment “x = y” is “x is a copy of y” . 

3. An object is default constructible iff no values are needed to initialize it. In 
C++, it defines a default constructor. 

4. An object is equality -comparable iff it provides a way to compare itself with 
other objects of the same type. In C++, such an object defines an operator 
== returning a boolean value. 

5. An object is less-than- comparable iff there exist a partial order relation on the 
objects of its type. In CH — h, such an object provides an operator < returning 
a boolean value. 

2.5 Iterators 

Quoting from SGI Standard Template Library reference documentation [SCI99| : 

Iterators are a generalization of pointers: they are objects that point 
to other objects. As the name suggests, iterators are often used to iterate 
over a range of objects: if an iterator points to one element in a range, 
then it is possible to increment it so that it points to the next element. 

Iterators are: 

1. default constructible. 

2. assignable (infix operator =). 

3. singular if none of the following properties is true. An iterator with a singular 
value only guarantees assignment operation. 

4. incrementable if applying ++ operator leads to a well-defined position. 

5. dereferenceable if pointed object can be safely retrieved (prefix operator *). 

6. equality-comparable (infix operator ==). 

Iterators constitute the link between the algorithm and the underlying data 
structure: they provide the sufficient level of encapsulation to make processing 
undependant from the data. 

2.6 Range 

A valid range [x,y) where x and y are iterators represents a set of positions 
from X to y with the following properties : 

1. [x,y) refers to all positions between x and y but not including y which is 
called the end-of-range iterator (also denoted as a past-the-end iterator). 

2. All iterators in the range except y are incrementable and dereferenceable. 

3. All iterators including y are equality comparable. 

4. A finite sequence of incrementations of x leads to position y. 
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3 Weaknesses of the Two-Layer Model 



ASTL was structured around a “two-layer” model with algorithms directly lying 
on the automaton data structure (Fig. 0. 




Fig. 1. Former ASTL structure 



This unfortunately induces limits to the algorithm application spectrum and 
shortcomings regarding mainly four aspects: 

1. A much too strong coupling between the algorithms and the automaton 
classes reduces genericity by enforcing strictly complying input data: a new 
algorithm version has to be written for more exotic data or a new entire 
automaton class has to be designed (which is not always possible due to 
combinatorial problems for instance). This yields multiple instances of the 
same code which contradicts generic programming purpose. Moreover, an 
intermediary layer allows for data hiding and algorithm application to other 
structures than automata. 

2. Common parts shared by many algorithms should be written only once and 
be reused as is. This means moving intensively used functionalities to a third 
party other than algorithm core or automaton class. The best example is the 
iteration over transitions of a DFA: all algorithms have an traversal policy 
and they can be reduced to a few ones, depth-first, breadth-first and a couple 
of others. A cursor is the place to implement these common parts. 

3. The algorithm must not impose a behavior that should be externally decided 
of: for example, writing a DFA union algorithm should not involve decisions 
about applying it on-the-fly, by lazy construction or by copy. The decision 
should be taken at utilization time and not at design time (see how to apply 
an algorithm in Sect. □l. 

4. Reusing and combining algorithms should be a straightforward operation 
requiring no extra code. Hard coded algorithms prevent such flexibility but 
cursor adapters allow it. 
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Fig. 2. Current ASTL structure 



4 Forward Cursor 

Using a cursor is a way of entirely hiding the processed data structure which 
needs not to actually be an automaton any more (Fig. 0) . Moreover, it allows 
to move functionalities from the algorithm to the cursor layer leading to clearer 
and lighter algorithms (see algorithm language in sect. Iti.hfl . 

4.1 Definition 

A forward cursor is basically an object pointing to a DFA transition (g,a,p) C 
Q X S X Q (possibly a sink transition) allowing access and forward moves along 
it. Likewise iterators they represent positions in the automaton and therefore 
can be used to define ranges over it. 

4.2 Properties 

1. A forward cursor is default constructible but has by default a singular value. 

2. A forward cursor is assignable. 

3. A forward cursor is equality-comparable. 

4. A forward cursor is dereferenceable (one can access to source, target and 
letter of pointed transition) iff it does not point to a sink transition. 

5. A forward cursor is “incrementable” (it may move along the currently 
pointed transition) iff it does not point to a sink transition. 

4.3 Interface 

In the following table, we will write: 

X for a type which is a model of forward cursor 

X, y for objects of type X 

a for a letter (an unsigned integral type) 

(q,a^p) G Q X S X Q for the transition x is pointing to 



200 



V. Le Maout 



Name 


Expression 


Semantics 


source state 


X . src C) ; 


return q 


aim state 


X . aimO ; 


return p 


letter 


X . letter () ; 


return a 


final source 


X. src_final() ; 


return true if q E F 


final aim 


X . aim_f inalO ; 


return true if p E F 


comparison 


(X == y) 


true if X points on the same 
transition as y 


forward 


X . f orwardO ; 


move along currently pointed 
transition 


forward with letter 


X . f orward(a) ; 


move along transition labeled with a 
return true if a) 7^ 0 


first transition 


X . f irst_transition() ; 


set X on the first transition of <[2(9) 
return true if S 2 (q) 7^ 0 


next transition 


X . next_transition () ; 


move on to the next transition of 
set 52{q). return false if reached 
transition [q,a' ,p') is a sink transition 


find 


X . f ind(a) ; 


set X on the transition labeled with 
letter a. return true if [q,a,5\{q,a)) 
is not a sink transition 



4.4 Using Cursors and Their Adaptability Power 

Here is an example of algorithm testing if a word defined by a range 
[first, last) is in the recognized language of a DFA accessed through a cursor 
c: 

bool is_in(lterator first, Iterator last, ForwardCursor c) { 
while (first != last && c . forward (*first) ) 

++f irst ; 

return first == last && c . src_f inal () ; 

} 

Any complying cursor can be passed to this algorithm: the default forward cursor 
or any cursor adapter implementing a different behavior and extra functionali- 
ties. Here are a few examples: 

— six set operations featuring union, intersection, difference, symmetrical dif- 
ference, negation and concatenation. 

— a hash cursor, computing a hash value for the recognized words along its 

path. This algorithm described in realizes a bijective mapping be- 

tween the recognized words of the DFA and integers providing a fast perfect 
hash function. 

— default transition cursors using a variety of failure forward functions making 
for instance the automaton complete in a trivial way: if requested transition 
is a sink transition then the cursor moves along the default transition. Such 
cursors are used in the implementation of the Aho-Corasick pattern matching 
algorithm. 
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— string and C-string cursors giving to a simple word or sequence a cursor 
interface and therefore a flat automaton look. 

— permutation cursor implements in a very simple way a virtual automaton 
recognizing all permutations of the word 123. ..n 

— a scoring cursor computing an integer value for a matched pattern as the sum 
of all the scores of the matched sub-expressions. This was used in a search 
engine to evaluate the amount of confidence for each piece of matched text. 

In the following three sections, we will take a closer look at two of these adapters, 
the set operations and the permutation cursor which highlight the cursor encap- 
sulation power. 

Set Operations. Union, intersection, difference, symmetrical difference and 
concatenation cursors are binary forward cursor adapters. They make use of 
two underlying forward cursor and provide the same complying interface. They 
perform on-the-fly all algorithmic operations needed and can be immediately 
used. The following piece of code check if word word is in the recognized language 
of the intersection of two DFAs A1 and A2 : 

char word[] = "word to check"; 

DFA Al, A2; 

f orward_cursor<DFA> cl (Al . initial ()) , c2(A2 . initial ()) ; 
if (is_in(word, word + 13, intersection_cursor (cl , c2))) 
cout << "ok"; 
else 

cout << "not found"; 

The negation cursor is a unary forward cursor adapter on a DFA A allowing 
access to a virtual DFA whose language is A* \ L(A). 

Permutations Automaton. This example shows a convenient way to hide 
data structure and to overcome memory space limitations and combinatorial ex- 
plosion. 

The regular expression matcher we developed had to look for n patterns com- 
bined in any possible order. The matcher engine had been previously written 
to search a range [first, last) of characters with a cursor c on the regular 
expression DFA: 

bool matchdterator first. Iterator last, ForwardCursor c) { 
for(; first != last && c . f orward(*f irst) ; ++first) 
if (c . src_f inalO ) return true; 
return false; 

> 

By simply changing the default cursor to a multiple cursor encapsulating n 
forward cursors we can look simultaneously for n patterns with the same algo- 
rithm. Assigning a unique integer i G [l,n] to each pattern, the problem comes 
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down to match a word in the DFA recognizing all permutations of the sequence 
l,2,3,...,n. 

Our first approach was to precompute the automaton for all permutations, 
but this quickly turns out to consume too much memory space when n reaches 
9. The second step was to minimize the automaton which is very efficient: by 
using a compact data structure, we could shrink the automaton for n = 8 to 
about 20 Ko but the memory usage peak due to minimization and the pretty 
long processing time remained major drawbacks. Eventually, we created a cursor 
moving on a virtual automaton simply by keeping tracks of the letters of visited 
transitions in a bit vector. A vector full of 1 means you have reached an accepting 
state. 

5 Stack Cursor 

A stack cursor is in some sense a bidirectional cursor. By keeping track of its 
previous positions during iteration, it is able to move back. 

5.1 Definition 

A stack cursor is a forward cursor using a stack to store its path along the 
automaton allowing thus backward iteration. It is constituted of a stack of for- 
ward cursors and all operations apply to the stack top. Its behavior relies on the 
underlying forward cursor. 

5.2 Interface 

The interface of stack cursor is very close to the forward cursor interface: the 
forward action pushes the resulting cursor on to the top and the backward 
action pops it. 

Remark that the comparison operator compares not only tops but entire 
stacks as the underlying concept is path and not transition. The reason will be 
made clear from the depth- first cursor abstraction in Sect. 0 

A stack cursor must implement the forward cursor requirements plus: 

From the default stack cursor implementation we designed an adapter called 
the neighbor cursor. 

5.3 Application: The Neighbor Cursor 

This is the central issue of a spelling corrector design: given a word w and an 
editing distance d, find all words w' recognized by an automaton A for which the 
edition operations consisting of successive letter insertions, deletions and substi- 
tutions needed to reach w' from w do not exceed a score of d. Each operation 
type is assigned a relative score (a weight) and the final distance is the scores 
sum. 

The neighbor cursor adapts the default stack cursor in two ways: 
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Name 


Expression 


Semantics 


forward 


X . f orward ( ) ; 


move along current stack top transition 
and push reached transition (p, a' ,p') 


forward with letter 


X. forward (a) ; 


move along transition labeled with a 
push {q, a, Si[q, a)) and return true 
if Si(q, a) A 0 


backward 


X. backward 0 ; 


pop. return false if resulting stack is 
empty 


comparison 


t>^ 

II 

II 

X 


return true if x stack is a copy of y 
stack 



1. It has the responsibility to hide paths straying too far away from w. 

2. It has to adapt the stacking policy to manage the deletion operation: when 
moving forward, it has to move on to the next letter of w but has to stay on 
the current transition. That means pushing a copy of the cursor stack top 
leaving it unchanged. 



6 Depth-First Iteration Cursor 

6.1 Definition 



A depth-first cursor, as the name suggests accesses the transitions of a DFA in a 
depth-first order. It relies on the stack cursor implementation but belongs to a 
different concept: a depth- first cursor represents an algorithm stage rather than 
a mere position in the automaton. 



6.2 Properties 

1. A depth-first cursor iterates on transitions in a depth-first order. 

2. A depth-first cursor never pushes sink transitions (the underlying stack holds 
only dereferenceable cursors). When reaching a sink transition, the action 
taken is conceptually equivalent to a pop. 

3. A depth-first cursor represents an algorithm stage and consequently can be 
used to define algorithm applying ranges passed to a function. 

4. It is default constructible, assignable, equality-comparable, dereferenceable, 
incrementable and by default represents the empty stack. 





204 



V. Le Maout 



6.3 Interface 



Name 


Expression 


Semantics 


source 


X . src 0 ; 


return q 


aim 


X . aimO ; 


return p 


letter 


X . letter 0 ; 


return a 


final source 


X . src_f inal 0 ; 


return true ii q E F 


final aim 


X . aim_f inal 0 ; 


return true ii p (z F 


forward 


X . f orwardO ; 


move on to the next transition in depth-first order, 
return true if x actually moved forward or false 
if X poped. 


comparison 


(x == y) 


compare entire stacks. 



6.4 Algorithm Depth-First Iteration 

To grant more freedom to algorithms users we will use depth-first cursors to 
define algorithm applying range. Starting from a position in the automaton 
given by a cursor set to a specified transition, algorithms will apply until a stop 
condition becomes true, most of the time until the stack is empty. This stop 
condition can be represented by a cursor too. Remember depth-first cursors are 
algorithm stages. Consequently, initializing a depth-first cursor with the empty 
stack makes it a valid end-of-range position and algorithm stops when the moving 
cursor reaches the empty stack state. 

6.5 Application: The Language Algorithm 

void Icuiguage (DepthFirstCursor first, DepthFirstCursor last) { 
vector<char> word; 

while (first != last) { // until end of range 

word. push_back(f irst . letter 0 ) ; // push current letter 

if (f irst . aim_f inal 0 ) output (word) ; // if target is final 

// output the word 

while (! f irst . f orwardO ) // while moving backward 

word.pop_back() ; // pop the last word 

letter 

} 

> 

Example: display the automaton A language (Fig. EJ 
DFA A; 

// initialize start of rainge with state 1: 
f orward_cursor<DFA> begin(A. initialO ) ; 

// set the cursor on transition (l,a,2): 
begin. first_transition() ; 

// use empty stack (default value) for end of range: 
lcinguage(depth_first_cursor (begin) , depth_f irst_cursor()) ; 
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Fig. 3. Automaton A 



Restricted range: display the sub-automaton A' language 

// initialize start of range with transition (l,a,2): 
f orward_cursor<DFA> begin(A . initialO ) , end(A. initialO) ; 
begin . first_trEinsition() ; 

// initialize end of range with trainsition (l,b,6): 
end. find ( ’b ’ ) ; 

lainguage(depth_first_cursor (begin) , depth_f irst_cursor (end) ) ; 

7 Applying Algorithms 

The main benefit of cursors is to allow to decide at the last minute in which 
way to apply an algorithm rather than at design time. Most of the time, one of 
the following policies is more adapted to one’s need but the algorithm cannot be 
aware of it. Instead of writing three versions of the code, cursors allow external 
decisions. 

7.1 On-the-Fly Processing 

All cursors perform on-the-fly data processing. This has many advantages: 

1 . By writing one version of an algorithm you get the other two implementations 
for free (lazy and copy). 

2. Sometimes combinatorial problems get in the way and there is no means to 
build the desired automaton but a cursor adapter is able to simulate a DFA 
(see Sect. 14.41 about permutations). 

3. Sometimes operation is only punctual and there is no need to build an entire 
automaton as in Sect. lOI for intersection. 

Consequently, algorithms impose only constraints on the public interface and 
possible extra internal processing is left up to the user. 
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7.2 Lazy Construction 

When the ratio of preprocessing time against computing time becomes too large, 
it is necessary to delay the actual construction or more precisely to incrementally 
build the resulting automaton. A famous example is the incremental construction 
of a DFA from a regular expression during the scanning of the text in ESnHEI. 
It is also particularly interesting in determinizing a NFA. 

In this case, construction is made internally by a cursor adapter called the lazy 
cursor completely encapsulating the extra processing. This adapter is passed to 
the algorithm where a default cursor would be passed in an on-the-fly operation. 

7.3 Building by Copy 

Whenever the actual resulting DFA construction is needed, one can either use the 
clone algorithm which makes an exact copy from a cursors range into a new DFA 
or the ccopy algorithm (cursor copy) which duplicates the input DFA through a 
cursor in ascending stage of the depth-first iteration, trimming unneeded paths 
leading to non accepting states. 

7.4 Algorithm Combining 

Cursors offer a simple and straightforward way to extend algorithm power. 
For example, one can retrieve the language of the difference of two languages 
concatenation and the language of a DFA within an editing distance of 5 from 
a specified word: 




Cursors 



207 



By directly calling the language algorithm described in Sect. lfi.f)! we process the 
data on-the-fly but we can as well build the resulting automaton A by using the 
ccopy algorithm: 

ccopyCA, depth_f irst_c ( // start of range 

neighbor_c ( 

diff _c (f orward_c (df al) , 

concat_c (f orward_c (df a2) , f orward_c (df a3) ) , 
"word" , 5) ) ) , 

depth_f irst_c 0 ) ; // default end of range 

8 Conclusion 

Speed tests have been conducted to compare with classical recursive implemen- 
tations speeds and it turned out that no time was waste because cursors are 
simple objects with very basic abilities: their implementation does not require a 
huge and complicated piece of code and most of the time a simple, optimal and 
straightforward solution does the trick. 

An extended version of this document provides a rigorous detailed description 
of cursors behavior including time complexities and C-| — h signatures of methods 
0 . Cursors have been successfully implemented in a search engine called word 
grep 0 . We have overcome most of the problems encountered in the first stage of 
ASTL development and encouraging results are leading us to consider extending 
them to cyclic automata and transducers. 
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Abstract. We present a simple theoretical model of web navigation, in 
which each WWW user creates style specifications which constrain web 
browsing, search and navigation using the user’s own judgement of the 
quality of visited sites. A finite state automaton is associated with each 
specification, which is presented in a two-level modal logic making up the 
acceptance condition for the automaton. We show that many interesting 
queries regarding the user’s web search can be answered using standard 
automata theory. 



1 Motivation 

A classical application of automata theory is searching a given text for occur- 
rence of patterns. Finite state automata are employed in lexical analysers and 
document processors for pattern search. Recently, the world of documents has 
grown very fast, thanks to the World Wide Web (WWW), where we do not 
speak of text, but hypertext. Here documents contain pointers or links to doc- 
uments elsewhere. However, access to documents is not only via these links but 
also using search engines which, given a keyword, locate a document containing 
the keyword. In such a situation, a natural application of automata theory seems 
to be that of web navigation. 

On the other hand, we could argue that the distributed nature of information 
on the net is irrelevant and we can consider the entire web as one single huge 
piece of text. Search engines like Alta Vista or Google in effect do that, act as 
finite state automata that access the entire web when they search for keywords. 
In this manner, one can argue that there is no need for any new automata theory 
for searching the web, but only cleverer methods for data representation. 

While there is some measure of truth in this argument, we believe that there 
is need for automata theory in this context from a slightly different viewpoint, 
that of quality controlled navigation. Quality remains the biggest challenge of 
search engines Pen99j . and there is a perceived urgent need for ways by which 
web users can direct and control navigation based on their own judgement of 
quality of web sites. 

In this paper, we present a simple (naive) theoretical model of user controlled 
navigation on the web. In this model, each user prepares a style sheet in which she 
specifies her judgement of sites based on the information they contain, and sets 
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down constraints that every navigation must satisfy. The style sheet induces a 
finite state automaton that can be thought of as her own navigation-cum-search 
engine, which uses some general purpose engine like Alta Vista to access the 
web. These automata can be seen as customized front-ends to standard search 
engines. Formally, they are finite state transition systems, and their behaviour is 
given by partitioning the set of possible runs (or trajectories, as they are called 
here) into good and bad runs based on the constraints specified in the style 
sheet. 

We then have a situation where the input to an automaton is given as a finite 
graph whose nodes carry input information, and the finite state control dictates 
which nodes to visit, in what order. It defines a process which originates from a 
home location, migrates to various nodes on the web picking up information on 
the way and returns home eventually. 

The main contribution of the paper is a framework for the theoretical study of 
quality controlled web navigation. The framework is built on standard concepts 
and hence offers hope for implementation. It must be emphasized that the paper 
presents only the initial steps of a theoretical framework, and it is hoped that 
more realistic models will follow. However, design features of such models (like 
whether links should be static or dynamic, bookmarks implicit or explicit, search 
distributed or not, whether transitive closure of paths on the web is needed) 
should be decided by empirical compulsions rather than theoretical elegance. 
Corpus studies like 



VTM98j should dictate how this theory should proceed. 
Related work: User-controlled navigation has been studied by several re- 

searchers from the WWW community, under the name of spiders, crawlers, 
robots, knowbots and so on. Miller and Bharat mm give a kind of style spec- 
ification using C-I-+. Mendelzon, Mihaila and Milo |M M M9(ij modify the query 
language SQL to operate on the web. Pazzani, Muramatsu and Billsus pMB96| 
describe a system which piggybacks onto a Lycos browser. 

However, the use of automata theory for modelling these systems appears 
to be new. Automata working on graph structures have been studied for long: 
the important topic of visiting all sites in a labyrinth is surveyed in lEDM. 
Thomas has recently emphasized the importance of studying automata 

on partially ordered structures. All these papers concentrate on the class of 
graphs accepted/traversed by such automaton models; but, as we will see, the 
approach followed here is distinct. Even if we were to consider the Web as graph 
input to the automaton, due to the keyword search facility, the automaton is 
not obliged to follow the paths in the graph. 



2 The Model 

In this paper, automata behave in a somewhat non-standard fashion: there is 
one single finite graph processed by all automata. The behaviour of an automaton 
is some set of trajectories in this graph (as opposed to a set of graphs). Each 
node of the graph is supposed to contain some information, represented by a 
finite set of letters from a fixed alphabet. The finite state control moves on the 
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graph not linearly, but according to a specified constraint, either using an edge or 
using a search, in general the automaton does not process the entire input graph, 
but its behaviour is given by trajectories that take it from a designated home 
location back again to the home location visiting nodes as given by a navigation 
constraint. The input graph is meant to represent the internet, with URLs as 
nodes and links as the edge relation. Each automaton is thought of as a search 
strategy or a search engine defined by a user whose home location on the net 
defines the start node for the automaton. Every run of the automaton navigates 
the net in some way. The user controls this navigation by specifying requirements 
on the quality of information in accessed sites. Therefore, the (static) description 
of the net comes with a specification of information available at each node, and 
each user automaton includes a quality Judgement of sites, based on information 
they contain. We use letters of a finite information alphabet to abstractly 
associate keywords with each node, and a set of propositional letters to denote 
quality judgement. 

Which locations in the net are accessible to any specific user ? In principle, 
all. However, if the user can access arbitrary locations only by using a search 
engine, then the vocabulary used to query the engine effectively limits the range 
of locations accessed. This is reflected in the model below. 



2.1 Navigators 

A web is a tuple W = where [/ is a finite set of locations (URLs), 

L C (U X U) is the link relation, / is a finite information alphabet, 6 : C/ — >■ 2^ is 
the content map. 

We use u,v, . . . to refer to elements of U and a,b, . . . to refer to elements of 
I. Below, we will define a logical language in which we can talk about specific 
locations, connectivity between locations and navigation paths on the net. Let 
P = {potPi, ■ . ■} be a countable set of propositional letters. The set of all formu- 
las of the logic is denoted 'P, and we use (j), (j)' etc to refer to formulas in (p. 
denotes the set of propositions occurring in 4>. For the rest of this section, fix a 
web W = {U, L, I, t). 

A navigator on lU is a tuple Nw = {h,Voc,x,4’), where h £ U; Voc, the 
vocabulary of N\v, is a nonempty subset of / such that i,{h) C Voc; x ■ 2^°'^ — >■ 
2^* is a quality map and (j) £ <P. U{Nw) {u £ U \ {l{u) HVoc) yf 0}, is said 
to be the set of locations visible to N\y. V al{Nw) '■ U{N\v) — >■ 2^* is the induced 
valuation defined by: Val{Nw){u) = x(t(u) H Voc). 

h is the home location of the navigator Nw- Note that h is always visible to 
Nw. X is a map that indirectly assigns judgements to locations in U , which is 
given by the induced valuation map. 

Let Nw = {h,Voc,x,<t>) be a navigator on W. The subnet visible to Nw 
is given by: W = ([/', L',Voc, d), where U' = U{Nw), L' = LD ([/' x [/') and 
i' : U' ^ is defined by: t'('u) = t(u) fl Voc. W' is denoted as W\Nw. 

A trajectory of Nw is a finite sequence p G {U{Nw))* of the form 
uqUi ■ ■ ■ uk, K > 0, where uq = uk = h. Thus a trajectory is an itinerary 
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of locations visited on the net. We do not explicitly represent bookmarking vis- 
ited locations; as we will see below, the navigation constraint (p can refer to sites 
that have been visited already, and this provides an implicit way of referring to 
bookmarks in trajectories. 

Let Tr{Nw) denote the set of all trajectories of N]^. The behaviour of 

def 

is given by the set Beh{Nw) = {p & Tr{Nw) \ p h 4>}i where the relation 
p 1= ^ is defined below. Thus, we need to give the syntax and semantics of the 
logic to complete the definition of the model. 

The formulas of the logic are presented in two layers. The lower layer consists 
of location formulas, and their syntax is given as follows: 

r ::= p G P I -1 Of I oi V 0:2 | (from) a \ {to)a 

The other logical connectives A, 3 , = are defined as usual. The dual modal- 
ities are given by: 

[/rom]a ->{from)-'a and [to]a ->{to)->a. 

The syntax of navigation formulas is given as follows, where a G F: 

<P ::= a \ ^(j> \ (j>\ V (p 2 \ S/Ci | t> </) | cd(j) \ <>(p 

Location formulas specify how locations with specific quality of information 
are interlinked in the net, with {from)a asserting the existence of a link from a 
node satisfying a to the current node, and (to) asserting the other way about. 
\ja refers to an implicit bookmark to a site where a holds; it says that such a site 
has been visited in the trajectory at this stage, and hence that such information 
is available to the user. \><j) asserts that the navigation given by 4> continues 
along one of the links available at the current location (by a ‘click’), whereas 

al(j) directs a search for a location satisfying a, from where the navigation (p 

d©f 

continues. O represents eventuality, and its dual is given by □(/) = -i<y-<(j>. 

Let Nw = {h, Voc, x, <P) be a navigator, W = ([/', L', Uoc, t') the subnet of 
W visible to Nw and V al{Nw) the valuation induced by Nw- Location formulas 
are interpreted over locations of W . IT', u\=i a denotes that a holds in location 
u in net W . 

— W , M |=; p iff p G Val{Nw){u). 

— W , u |=; —la iff W', u a. 

— kU', M a V /3 iff W', u\=i a or W , u \=i (3. 

— W ,u |=; {from)a iff there exists u' such that (u',u) G L' and W',u' \=i a. 

— W',u {to)a iff there exists u' such that {u,u') G L' and W',u' \=i a. 

Let p = UqUi ■ ■ ■ uk G Tr{Nw)- For cp G F and k G {0, . . . , AT}, the notion 
p. A: ^ ^ is again defined inductively. 

— p,k 1= a, for a G P, iff W', Uk \=i a. 

— p,k \= ->(p iS p,k ^ (p. 

— p,k \= (piV (p 2 iS p,k \= (pi or p,k \= (p 2 - 
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— p,k\= \/a iff there exists m : 0 < m < k : W ^ Um 

— p,k \= >(j) iS k < K, (uk, Uk+i) G L' and p,k + 1 \= (j)- 

— p,k \= al(j) i?L k < K, W' ^ Uk+i \=i a and p,k + l \= (j). 

— p,k \= 0(/) iff there exists m : k < m < K such that p^m \= (p. 

We use the notation p \= cp when p,0 |= p. This completes the definition 
of navigator Nw- Note that formulas do not directly refer to the information 
contained in locations, but only to the quality of such information, as given by 
X- ctlp is more natural here, but we could have formulas w7p, where w is a 
keyword in V oc, without affecting our results. 

3 Analysis 

Given a web W and a navigator N\y on it, we can analyse W and to 
answer a number of queries regarding the ‘search engine’ given by ■ The 
following queries seem natural: Does the navigation meet some basic quality 
requirement ? Is a site with some specific information visited at all ? Is it the 
case that every visited site of some quality has links only to sites having the 
same quality of information ? Is the navigation constraint consistent ? That is, 
is Beh(NiY) yf 0 ? Many of these and other related questions can be answered 
by a simple technique mm -. construct a verifying automaton (an ordinary 
NFA) which captures the behaviour of the given navigator, and run standard 
algorithms on that automaton. Below we describe how we can associate such an 
automaton with every navigator. 

Define as usual the subformula closure of any formula p & denoted CL{p). 
The size of CL{p) is linear in the length of p. Let i>Ar denote the set of all 
formulas in CL(p) of the form Similarly let ?at contain all alpj formulas in 
CL{p). Let Nxt 

Let A C CL{p). A is said to be an atom iff it satisfies the following conditions: 

— For all -fp G CL{p), -rp G A p) ^ A. 

— For all piV p 2 G CL(p), pi V p 2 G A iS pi G A or p 2 G A. 

— An?7v = 0 or An>iv = 0. 

— If A n Nxt = 0 then for all G A, jp G A. 

Assume that we are given a web W = and navigator Nw = 

{h, Voc, X, Po) on it. % would be presented as a list of boolean formulas on Voc, 
one for each p G P<f>o- 

Let |(()ol = ■'TT-o- Let CL denote the subformula closure of pg. Fix an ordering 
^ of CL such that if \pi\ < \p 2 \ then pi ^ p 2 - Let CL{i) denote the formula 
in the enumeration, and let all formulas from L precede those from — L . 

Step 1: Construct W\Nw = (G', L', Moc, P). This construction, which takes 
0{\U\) time in the worst case, depends crucially on how l is represented. Let 
M = \Lf'\. Fix an enumeration of locations in U' . 

Step 2: Construct a \{CL fl T)| x M boolean array p such that p{i,j) = 1 
iff W', u \=i a, where CL{i) = a and u is the location in the enumeration of 
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W . (We will use the notation u) = 1 in this case.) This construction can be 
done in 0{M.mo) time. However, we can identify locations as follows: ui ~ U 2 iff 
for all a G {CLdF), fi{a, ui) = fi{a, ^ 2 ). Thus we only need a boolean mo x 
array. Let [u]) represent these elements. 

Step 3: Let AT denote the set of all subsets of CL which are atoms. 
Let Vis {a G r I ya G CL{(j))}. The states of our verifying automaton 

def 

An = {Q,^, I j P) keep track of these formulas in addition to the URLs vis- 
ited: 

— Q = {([u],A, S') \ u G L',A G AT, S C Vis and for all a G T: a G A iff 

fi{a, [u]) = 1 and if \/a G A then a G S}. 

- 7 = {{[h],A,Ar\Vis) I A G AT,^o € A}. 

— F = {{[h],A, S) I A G AT such that A n Nxt = 0}. 

- (M,A,S)^(H,73,S') iff 

1. An Nxt yf 0, 

2. litxj) G A then cj) G B and ([u], [u]) G [L'\, 

3. if al4> G A then /r(a, [u]) = 1 and 4> G B, 

4. if 04> G A and 4> ^ A then G B, and 

5. S' = SU{BnVis). 

For q = ([m],A, S) G Q, we use the notation q\U to denote u. Let p = 
uqUi ■ ■ .Uk G U* . We say that p is accepted by An iff there exists a sequence 
9o9i ■ ■ - <lk such that go ^ I Nk ^ F, and for all i : 0 < z < fc, g^ => g^+i and 
p = qo\U . . .qk\U. Let Lang{AN) denote the set of all strings in U* accepted 
by An- 

Theorem 1. If An is the automaton associated with navigator Nyy on web W, 
then Beh(Nw) = Lang{AN)- 

The construction takes 0{\U\) + 0{M.mo) + 2'^-’"“, that is, 0{\U\) + 
time, where c is a (small) constant. We can then answer queries like the ones 
listed in the beginning of this section. Since the number of URLs in the web is 
large, the fact that the 0{\U\) factor is additive is crucial. It can then be seen as 
a preprocessing step for the algorithm. M = \U'\ is usually much smaller, and 
hence the multiplicative factor is more acceptable. 

After elimination of useless states and transitions, we can have an efficient 
representation of An that helps us do reachability analysis, enumerate connected 
components, and so on. Then, checking whether a location is visited, or whether 
there exists a trajectory visiting all locations in a given set, is simple. 



4 A Web Example 



An example researcher who looks for grants dealing with work on genomes is con- 
sidered by Pazzani, Muramatsu and Billsus [PM H9fij . We consider a style sheet 
for such a researcher. The web W = {U,L,I,i) is given, and a relevant portion 
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of the structure (C/, L, i.) is shown in Figure 1. The style sheet is given by Nw = 
(Home, V^oc, %,(()), where Voc = {grant, genome, institute, foundation, project, 
experiment, cloning, publist, appform}, and I = Voc U {chromosome-22, 
sex, . . 




< 



Fig. 1. Grants and genomes on the web 



The set of propositions consists of {commercial, government, authority, hub, 
spam, submit, interesting, highstatus, money, ...}. For our example, we as- 
sume the only commercial site is Gene Inc and the only government site is 
DBT. Authority and hub !TT^ refer to a URL having high in- and out-degree 
respectively. These are indicated in the figure by a few arrows coming into and 
going out of a node, respectively. Spam, submit, interesting are true in URLs 
which have sex, appform and {grant, genome} respectively. 

The two propositions listed last decide parameters of interest to our re- 
searcher: whether the URL represents a place which commands status and 
whether it has money. Sites satisfying highstatus are shown shaded in the fig- 
ure: the proposition is true in those sites which represent a foundation, or which 
do research in cloning, or which have a large list of publications (that is, a link 
to a site with keyword publist which satisfies hub). Money is true in the two 
government and commercial sites, as well as in all the sites having the keyword 
foundation. 

This completes the (indirect) definition of the quality function. We now come 
to the navigation part. The navigation constraint (j) is defined as the conjunction 
of formulas 4>i,i € {1, . . . ,5} given below. 
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<f>i = O-ispam eliminates junk sites. 

(j )2 = 0(^->highstatus d highstatus) ensures that a highstatus site is visited 
first. 

(j)^ = ^[authority 3 -•authority) similarly prioritizes sites which are less well 
known (and presumably have fewer contenders for grants). 

(f >4 = a (^(to) submit D ^submit) visits a link containing an application form. 

05 = interesting Amoney?0{interesting Amoney?0{interesting Amoney)) vis- 
its three sites, according to the user’s rating and priorities, which are interesting 
and have money. 

Having constructed the automaton, the user can now ask questions of in- 
terest: Is Gene Inc on a particular trajectory? Is there a trajectory visiting all 
interesting sites with money? and so on. 

5 Discussion 

The framework presented here is intended only as preliminary. As mentioned in 
Section 1, the design of the logical language should be guided more by application 
requirements than theoretical considerations. However, some simple extensions 
can be made without drastically altering the current framework. 

— p,k \= 0iB02 iff if there exists m> k such that p,m \= 02 , then there exists 
I : k < I <m such that p,l ^ 0i. (This allows priority in visiting locations.) 

— p, fc 1= <0 iff 0 < fc < AT, Uk-i = Uk+i and p,k + 1 \= 4>. (Back button.) 

— P,k \= >0 iff there exists m > 0 such that k + m < K, {u | {uk, u) £ L} = 
{uk+i , . • . , Uk+m} and and p,k + m \= 4>. (Visit all local edges.) 

— p,k \= Ea, iff {u I IV', u \=i a} C {ug, ui,. . . Uk}- (Exhaustive search.) 

— p,k \= a??0, iff fc < AT, IV', Uk+i \=i a, Uk+i ^ {ug, . . . , Uk} and p,k+l \= 4>. 
(New location.) 

The first two specifications can be easily incorporated into the automaton 
construction without increasing the complexity. The third case is rather tricky: 
the surprise is that it can be carried out without increasing the complexity. For 
the last two modalities, the automaton construction is easy to modify, but the 
cost increases dramatically. In the states of An, we need to carry the set of u- 
equivalence classes visited, so that the check for exhaustive visit or new visit can 
be made. Because of this, the construction becomes doubly exponential in the 
size of the navigation constraint in N]v. However, if we were interested only in 
the construction of Aat with the property that Beh{Nw) yf 0 iff Lang{AN) ^ 0, 
this could be done in singly exponential time. 

There are other interesting and desirable extensions which are not minor and 
require reworking the theory. For instance, there is a good reason to want quan- 
tification over locations, or over information objects. Extending the framework 
to include page structure within URLs is easy though rather messy; a different 
formulation may be more convenient. 

Another direction is explicit representation of actions in the framework, like 
bookmarking locations, downloading specific information objects, etc. An even 
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more challenging step is to study dynamic nets, where links are made and bro- 
ken during the course of navigation. This would lead to the consideration of 
several navigators cruising the web simultaneously, perhaps exchanging infor- 
mation among themselves. 

Such a picture raises the obvious question of information security which 
is not modelled here at all. The style sheets envisioned here will be realistic only 
if security considerations become part of navigation specifications. 
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Abstract. This paper presents an algorithm for direct building of min- 
imal acyclic subsequential transducer, which represents a finite relation 
given as a sorted list of words with their outputs. The algorithm con- 
structs the minimal transducer directly - without constructing inter- 
mediate tree-like or pseudo-minimal transducers. In NLP applications 
our algorithm provides significantly better efficiency than the other al- 
gorithms building minimal transducer for large-scale natural language 
dictionaries. Some experimental comparisons are presented at the end of 
the paper. 



1 Introduction 

For the application of large-scale dictionaries two major problems have to be 
solved: fast lookup speed and compact representation. Using automata we can 
achieve fast lookup by determinization and compact representation by minimiza- 
tion. For providing information for the recognized words we have to construct 
automata with outputs or transducers. The use of automata with labels on the 
final states for representation of dictionaries is presented by Dominique Revuz 
in jSj. In |bl/j Mehryar Mohri reviews the application of transducers for Natural 
Language Processing. He compares the benefits using subsequential transduc- 
ers. The transducers are more compact in some cases and can be combined by 
composition or other relational operators. The transducers can be applied also 
for the reverse direction - to find the input words which are mapped to a given 
output. 

In this paper we focus on building the minimal subsequential transducer for 
a given input list of words with their outputs. This is the procedure required 
for the initial construction of the transducer representing a dictionary. Earlier 
presented methods are building temporary transducers for the input list first, 
and later they have to be minimized. This temporary transducers can be huge 
compared to the resulting minimized one. For example in 0 Mehryar Mohri 
writes: 
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© Springer- Verlag Berlin Heidelberg 2001 



218 



S. Mihov and D. Maurel 



“But, as with automata, one cannot construct directly the p- 
subsequential transducer representing a large-scale dictionary. The tree 
construction mentioned above leads indeed to a blow up for a large num- 
ber of entries. So, here again, one needs first to split the dictionary into 
several parts, construct the corresponding p-subsequential transducers, 
minimize them, and then perform the union of these transducers and 
reminimize the resulting one.” 

In 0 Denis Maurel is building efficiently the pseudo-minimal subsequential 
transducer, which can be significantly smaller than the tree-like transducer. The 
pseudo-minimal transducer has to be additionally minimized. Our experiments 
are showing that for large-scale dictionaries the pseudo-minimal subsequential 
transducer is about 10 times larger than the minimized transducer. 

In this paper we present an algorithm for building minimal subsequential 
transducer for a given sorted list without the necessity of building any interme- 
diate non-minimal transducers. The algorithm is a combination of the algorithm 
for direct construction of minimal acyclic Finite-State automaton given in m 
with the methods for construction of minimal subsequential transducers given 
in m- The resulting subsequential transducer is minimal. 

In comparison with the approach of Mehryar Mohri we don’t build minimal 
intermediate transducers for parts of the dictionary which after deterministic 
union have to be minimized again. We are proceeding incrementally word by 
word building the minimal except for the last word transducer. 



2 Mathematical Concepts 

In this section we present shortly the mathematical basics used in the algorithm 
for direct construction of minimal subsequential transducers. A more detailed 
presentation with the corresponding proofs for the minimal except for a word 
automata is given in j^. 



2.1 Subsequential Transducers 

Definition 1. A p- subsequential transducer is a tuple T = (A, Z\, S', s, F, 
p,X,F), where: 

— F is a finite input alphabet; 

— A is a finite output alphabet; 

— S is a finite set of states; 

— s € S is the starting state; 

— F C S is the set of final states; 

— p:SxS^S is a partial function called the transition function; 

— X : S X S ^ A* is a partial function called the output function; 

— F : F ^ 2^ is the final function. We will require that \/r £ F (|'F(r)| < p). 
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The function /i is extended naturally over S x S* as in the case for finite state 
automata: 

\/r G S (/r*(r, e) = r) ]\/r G SMa G S* \/a G S (/r*(r, era) = a), a)). 

The function A is extended over S' x if* by the following definition: 

Vr G S (A*(r,e)=e); Vr G SVct G if*Va G if {X*{r,aa) = X*{r,a)X{^l*{r,a),a)). 

The set L{T) = {a G S* |/r*(s, cr) G F} is called the input language of the 
transducer T. The subsequential transducer maps each word from the input 
language to a set of at most p output words. The output function Oj- : L{T) — f 
2 ^ of the transducer is defined as follows: 

Va G L(T) (OrM = A*(s,a) • F(/r*(s, a))). 

Two transducers T and T' are called equivalent when L(T) = T(T') and 
Op = Op' . 

Definition 2 . LetT = (if, Z\, S, s, F, /r, A, F) be a subsequential transducer. 

1. The state r G S is called reachable from t G S, when 3a G if* {p*{t, a) = r). 

2 . We define the subtransducer starting in s' G S as: 

T\s' = {s, A, S',s' ,F n S', fj.\s'xs,X\s'xE,'l'\FnS'), where: 

S' = {r G S\r is reachable from s'}. 

3 . Two states si,S2 G S are called equivalent, when Al|si and A\s2 are equivalent 

(when F(T|sJ = L{T\s2) Op\^^ = Op^^J. 

We cannot use directly the minimization algorithms developed for automata 
because in some cases by moving the output labels (or parts of them) along the 
paths we can get a smaller transducer. To avoid this we have to use transducers 
which has the property that the output is pushed back toward the initial state 
as far as possible. Mehryar Mohri jSj shows that there is a minimal transducer 
which satisfies this property. We will define this more formally bellow. 

With u A V we denote the longest common prefix of the words u and v from 
E* and with u~^{uv) we denote the word v - the quotient of the left division of 
uv by u. For the set of words A = {ai, 02, . . . , a„| with f\A we will denote the 
word /\ A = oi A 02 A . . . A a„. 

With D(T) we will denote the set of the prefixes of L{T)'. 

D{T) = {u G S* \ 3 w G S* {uw G L{T))}. 

We define the function gp : D{T) — Z\* as follows: 
gp{e) = £\ gp{u) = f\ f\Op{uw), for uG D{T),u^ £. 

& uw^L{T) 

Now we are ready to define the canonical subsequential transducer. 
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Definition 3. The subsequential transducer T = {S, A, S, s, F, is called 

canonical if the following condition holds: 

Wr G S Wa G S \/a G S* {{^*{s, cr) = r & !//(r, a)) — >■ A(r, a) = [ffr(o')]~^5r(o'o)) 

We can see that the condition above corresponds to the property that the 
output is pushed back to the initial state as much as possible. 

The subsequential transducer F is called minimal if any other transducer 
equivalent to F has more or equally many states as F- 

Theorem 1. For every subsequential transducer F there exists a minimal 
canonical subsequential transducer equivalent to F- 



Theorem 2. If there are no different equivalent states in the canonical subse- 
quential transducer F then F is minimal. 

A more complete presentation of the minimal subsequential transducers can 
be find in ^). 



2.2 Minimal Except for a Word Subsequential Transducer 

Definition 4. Let F = (A7, Z\, 5, s, F, p,, A, F) be a subsequential transducer with 
input language L{F). Then the transducer F is called minimal except for the 
word oj G S* , when the following conditions hold: 

1. Every state is reachable from the starting state and from every state a final 
state is reachable; 

2. Lo is a prefix of the last word in the lexicographical order of L{F); 

In that case we can introduce the following notations: 

LO = w^wj . . . w'l , where wj G E , for i = 1,2, . . . , k ( 1 ) 
*0 =s-, tj = p,{tj,w'[) ; tl = fi{tj,wl) ; ... ; = p.(tl_^,wl) ( 2 ) 

T={tZ,tl,...,tl} (3) 

3. In the set S \ T there are no different equivalent states; 

4-. Vr G S \/i G {1,2, ... ,k} Va G E {pi{r, a) = O (i > 0 & r = Sz a = 

wT)); 

5. F is a canonical subsequential transducer. 



Example 1. An acyclic 2-subsequential transducer over the input alphabet 
{a,b,c} is given on Figure [D The input language of the transducer is 
[{apr, aug, dec, feb, jan, jul}.] The output function of the transducer is: 
0{apr) = {30}; O(oMg) = {31}; 0(<iec) = {31}; 0(/e6) = {28, 29}; O(jan) = 
{31}; 0{jul) = {31}. This transducer is minimal except for the word jul. 
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Fig. 1. Subsequential transducer minimal except for jul. 



Proposition 1. A subsequential transducer which is minimal except for the 
empty word s is minimal. 



Lemma 1. Let the subsequential transducer T = {S, A, S, s, F, p,, X,L') be min- 
imal except for ui = W\W 2 ■ ■ - Wk, OJ ^ e. Let there be no state equivalent to tk in 
the set S \ T. Then T is also minimal except for the word to' = W\W 2 ■ ■ ■ Wk-i- 



Lemma 2. Let the subsequential transducer T = {S, A, S, s, F, p, X,F) be min- 
imal except for lo = W\W 2 ■ ■ - Wk, lo ^ e. Let the state p € S\T be equivalent to 
the state tk- Then the transducer: 

V = {s, A,S\ {tk}, S,F\ {tk}, p', X\s\{t^}xs, l^|s\{tfc}) where: 

{ p{r, a) , in case r ^ tk-i \/ Wk and p{r, a) is defined 
p , in case r = tk-i, a = Wk 

not defined otherwise 

is equivalent to the transducer T and is minimal except for the word oj' = 
WiW2 ■ ■ ■ Wk-l- 
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Lemma 3. Let the subsequential transducer T = {S, A, S, s, F, fi, A, F) he min- 
imal except for oj = W 1 W 2 ■ ■ - Wk- Then for tk holds the following statement: 



tk is equivalent to r € S \ T ^ 

{{tk G F r G F) & {tk G F ^ F{tk) = F{r)) & 
\/aG S {{-i\ij,{tk,a) & -i!/x(r, a)) V {\ii{tk,a) & \p,{r,a) & 
fj.{tk,a) = n{r,a) & \{tk,a) = \{r,a))). 



Theorems. Let the subsequential transducer T = {F, A, S, s, F, p,, X,F) be 
minimal except for to' = W\W 2 ■ ■ - Wm- Let ip G L{T) he the last word in the 
lexicographical order of the input language of the transducer. Let to he a word 
which is greater in lexicographical order than if. Let the t G A* be the out- 
put for OJ. Let uj' be the longest common prefix of if and oj. Ln that case we 
can denote oj = W\W 2 ■ • ■ WmWm+i ■ ■ ■ Wk', k > m. Let us use Wn to denote the 
word Wn = W 1 W 2 . . .Wn ; n = 1, 2, . . . , fc and Wq = e. Let us use An to de- 
note the word An = \*{to,W„) A r. Let us define the subsequential transducer 
T' = {S, S' , s, F', p', A', F') as follows: 

tm+l,tm-k 2 , ■ ■ ■ ,tk (ITe nCW statcS such that S fl {im+l,^m+ 2 , • ■ • ,tk} = 0 



S — U \^tm-\-l , tnrt+2i ■ ■ • j tk'\ 

F' = F\J {tk} 

{ ti+i , in case r = ti,m < i < k — 1, a = 

p{r, a) , in case r G S and p{r, a) is defined and 

^ ^ tm V Cl ^ Wm+1 
is not defined otherwise 



\'{r, a) 



A(r, a) , in case r = S \ {^o, ti, . . . , tk} 

V(r = to Sz a ^ wi) 

[An-i]~^An , in case r = t„-i, a = Wn, n = 1,2, ... ,m 

< [yl„]“^A*(toj WnO) , in case r = tn,a ^ Wn+i,n = 1,2, . . . ,m 

, in caser = tm,a = Wm+i 

s , in case r = tn,a = Wn+i,n = m -\- 1, . . . , k — 1 

is not defined otherwise 



{ F{r) , in case r ^ {ti,t 2 , ■ . ■ ,tk} Sz r G F 

{e} , in case r = tk 

[yl„]“^A*(<o, Wn) ■ F{r) , in case r = tn G F,n = 1,2, ... ,m 

is not defined otherwise 



Then the subsequential transducer T' is minimal except for oj, and the following 
holds: L{F') = L{F) U {oj}, Ot'Il(T) = 0-pi{oj) = {r}. 
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Theorem 4. Let the subsequential transducer T = {S, A, S, s, F, /j,, A, F) be 
minimal except for u> = W\W 2 ■ ■ - Wk and to € L{T) be the last word in the 
lexicographical order of the input language of the transducer. Let t £ A* be 
a new output for u>, such that t ^ Oq-{uj). Let us use Wn to denote the 
word Wn = W 1 W 2 ■ ■ - Wn ; n = 1,2, ... ,k and Wq = e. Let us use An to de- 
note the word An = X*{to,Wn) A r. Let us define the subsequential transducer 
T' — {AJ, S, s, F, pL, A', F') as follows: 



{ A(r, a) , in case r = S \ {to, ■ ■ ■ , tfe} V (r = to & a^wi) 

[An-i]~^An , in case r = tn-\,a = Wn, n = 1 , 2 , ... ,k 

[yl„]“^A*(to, Wntt) , in case r = tn,a^ Wn+i,n = 1 , 2 , . . . ,k - 1 

is not defined otherwise 



F'{r) 



F{r) , in case r ^ jti, t 2 , . . . , tfc} & r G F 

[yl„]“iA*(to, Wn) ■ F{tn) , in case r = tn G F,n = 1,2, ... ,k - 1 

< [Ak]-^X*{to,Wk)-F{tk)U 

U{[4fe]“^r} , in case r = tk 

is not defined otherwise 



Then the subsequential transducer T' is minimal except for tu, and the following 
holds: LifT) = L{T), Or'\L{T)\{ui} = Or\L(T)\{ui} and Orfuj) = Or(w) U |r}. 



We can use the proving schema introduced in ^ to prove the lemmae and 
theorems for minimal except for a word subsequential transducer. The only dif- 
ference is that we have to check that the resulting transducers are canonical. 

We can use the following equations for an efficient computation of the func- 
tions A' and if' for the last two theorems. 



(ci = A(to, wi) A r , li = c^^X{tQ,wi) , = 

{C2 = {hX{ti,W2)) XTi , h = cf^{hX{ti,W2)) , T2=C^Vi) 

(C3 = ihX{t2,W3)) AT2 , h = Cf^{l2X{t2,W3)) , T3 = Cf^T2) 



— (^m— lX(^tni—l , Wni)) A Tni —1 , Im — (fni— lX{tni—l,U)nri))i'^m — Tni—i') 

We can calulate c„, r„ iteratively for n = 1, 2, . . . , m. 

We can proove by induction that: 

Cn = [A*(to, Wn-i) A r]“^(A*(to, Wn) A r) 

ln=[X*ito,Wn)AT]-^X*{to,Wn) 

Tn = [X*(to,Wn) A r]“V 

for n = 1, 2, . . . , m. Hence we have that: 



224 



S. Mihov and D. Maurel 



A (tm — ^nA(tn; 

'f'(in) = In ■ H^n) 
for a ^ Wn+i,n = 1, 2, . . . , m, and 

A'(fm,?«m+i) = Tm for Theorem 13 or 
^'(tk) = h • ^{tk) for Theorem El 

Now we can proceed with the description of our method for direct building 
of minimal subsequential transducer for a given sorted list of words. 

Let a non-empty finite list of words L in lexicographical order be given. Let 
for every word in L the corresponding output is given. Let denotes the i-th 
word of the list and denotes the output of the i-th word. We start with the 
minimal canonical subsequential transducer which recognizes only the first word 
of the list and outputs the output for the first word. This transducer can be built 
trivially and is also minimal except for Using it as a basis we carry out an 
induction on the words of the list. Let us assume that the transducer 7”^"^ with 
language \i = 1, 2, . . . , n} has been built and that is minimal 

except for and Oq-(n) for i = 1, 2, . . . , n. We have to build the 

Transducer 'T^n+i) language = {o;(®) \ i = 1, 2, . . . , n -I- 1} which is 

minimal except for and O^-cn+i) (tu^®^) = for f = l,2, ...,n-|-l. 

Let ui' be the longest common prefix of the words w^®®^ and Using 

several times Lemma^and Lemma 0 (corresponding to the actual case) we build 
the transducer T' which is equivalent to 7^(®®) and is minimal except for uj' . Now 
we can use Theorem 0 (or Theorem i if a;(®®) = 0 ;^'®+^^) to build the transducer 
7 "("+i) with language L(®®+i) = L(®®1u{o;^®®“''^^} = {tu^®) | j = 1 , 2 ,..., n-|-l} which 
is minimal except for w(®®+^) and O^-cn+i) (w*^®^) = r*^®) for j = 1, 2, . . . , n -I- 1. 

In this way by induction we build the minimal except for the last word of the 
list transducer with language the list L and the given output. At the end using 
again Lemma Q and Lemma |2| we build the transducer equivalent to the former 
one which is minimal except for the empty word. From Proposition Q we have 
that it is the minimal subsequential transducer for the list L and corresponding 
output. 

To distinguish efficiently between Lemma Q and Lemma 0 we can use the 
condition given in Lemma El □ 

Example 2. Let us consider the following example. On Figure [D the transducer 
minimal except jul with input language {apr, aug, dec, feb,jan,jul} and output 
function 0{apr) = {30}; 0{aug) = {31}; 0{dec) = {31}; 0{feb) = {28,29}; 
0{jan) = {31}; 0{jul) = {31} is given. After the application of Lemma 0 
and Theorem El we will construct the transducer minimal except for jun where 
0{jun) = {30}. This transducer is given on Figure 0 In this way we are adding 
the next word with the corresponding output to the transducer. 



Direct Construction of Minimal Acyclic Subsequential Transducers 225 




Fig. 2. Subsequential transducer minimal except for jun. 



3 Algorithm for Building of Minimal Subsequential 
Transducer for a Given Sorted List 

Here we give the pseudo-code in a Pascal-like language (like the language used 
in P). We will presume that there are given implementations for Abstract Data 
Types (ADT) representing transducer state and dictionary of transducer states. 
Later we presume that NULL is the null constant for arbitrary abstract data 
type. 

On Transducer state we will need the following types and operations: 

1. STATE is pointer to a structure representing a transducer state; 

2. FIRST_CHAR, LAST_CHAR : are the first and the last char in the input 
alphabet; 

3. function NEW_STATE : STATE returns a new state; 

4. function FINAL(STATE) : boolean returns true if the state is final and 
false otherwise; 

5. procedure SET JINAL(STATE, boolean) sets the finality of the state to 
the boolean parameter; 

6. function TRANSITION(STATE, char) : STATE returns the state to which 
the transducer transits from the parameter state with the parameter char; 

7. procedure SET_TRANSITION(STATE, char, STATE) that sets the tran- 
sition from first parameter state by the parameter char to the second pa- 
rameter state; 
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8. function STATE_OUTPUT(STATE) : set of string returns the output set 
of strings on final states; 

9. procedure SET_STATE_OUTPUT(STATE, set of string) sets the output 
set of strings on final states; 

10. function OUTPUT(STATE, char) : string returns the output string for the 
transition from the parameter state by the parameter char; 

11. procedure SET_OUTPUT(STATE, char, string) sets the output string for 
the transition from the parameter state by the parameter char; 

12. procedure PRINT_TRANSDUCER(file, STATE) prints the transducer 
starting from the parameter state to file. 

Having defined the above operations we make use of the following three 
functions and procedures: 

1. function COPY_STATE(STATE) : STATE copies a state to a new one; 

2. procedure CLEAR_STATE(STATE) clears all transitions of the state and 
sets it to non final; 

3. function COMPARE_STATES(STATE, STATE) : integer compares two 
states 

The ADT on Dictionary of transducer states uses the COMPARE_STATES 
function above to compare states. For the dictionary we need the following op- 
erations: 

1. function NEW -DICTION ARY : DICTIONARY returns a new empty dic- 
tionary; 

2. function MEMBER(DICTIONARY, STATE) : STATE returns state in the 
dictionary equivalent to the parameter state or NULL if not present; 

3. procedure INSERT(DICTIONARY, STATE) inserts state to dictionary. 

Implementation for the above ADTs could be found in e.g. PJ. Now we are 
ready to present the pseudo-code of our algorithm. 

Algorithm 5. For direct building of minimal subsequential transducer present- 
ing the input list of words given in lexicographical order with their corresponding 
outputs. 

1 program Create-MinimaLTransducerJor-GivemList ( input, output); 

2 var 

3 MinimalTransducerStatesDictionary : DICTIONARY; 

4 TempStates : array [O..MAX.WORD.SIZE] of STATE; 

5 InitialState : STATE; 

6 PreviousWord, CurrentWord, CurrentOutput, 

WordSuffix, CommonPrefix : string; 

7 tempString : string; 

8 tempSet : set of string; 

9 i, j, PrefixLengthPlusl : integer; 

10 c : char; 
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11 function FindMinimized ( s : STATE) : STATE; 

12 {returns an equivalent state from the dictionary. If not present - 

inserts a copy of the parameter to the dictionary and returns it.} 

13 var r ; STATE: 

14 begin 

15 r := MEMBER(MinimalTransducerStatesDictionary,s); 

16 if r = NULL then begin 

17 r := COPY_STATE(s); 

18 INSERT (r); 

19 end; 

20 return frj; 

21 end; {FindMinimized} 

22 begin 

23 MinimalTransducerStatesDictionary := NEW-DICTIONARY; 

24 for i := 0 to MAX-WORD-SIZE do 

25 TempStatefi] := NEW-STATE; 

26 PreviousWord := ”; 

27 CLEAR-STATE (TempStatefO]); 

28 while not eof (input) do begin 

29 {Loop for the words in the input list} 

30 readln(input, CurrentWord, CurrentOutput) ; 

31 { the following loop calculates the length of the longest common 

prefix of Current Word and Previous Word } 

32 i := 1; 

33 while (i<length( CurrentWord)) and (i<length( PreviousWord)) 

and (Previous Word [i] = CurrentWord [i]) do 

34 t t-f-l; 

35 PrefixLengthPlusl := i; 

36 {we minimize the states from the suffix of the previous word } 

37 for i := length (PreviousWord) downto PrefixLengthPlusl do 

38 SET-TRANSITION(TempStates[i-l], Previous Word[i], 

FindMinimized) TempStatesfi] ) ); 

39 { This loop initializes the tail states for the current word} 

40 for i := PrefixLengthPlusl to length (CurrentWord) do begin 

41 CLEAR-STATE(TempStates[i]); 

42 SET-TRANSITION(TempStates[i-l], CurrentWord[i] , 

Temp States [i]); 

43 end; 

44 if CurrentWords <> PreviousWord then begin 

45 SET-FINAL(TempStates[length(CurrentWord)], true); 

46 SET-OUTPUT(TempStates[length(CurrentWord)], {”}); 

47 end; 

48 for j := 1 to PrefixLengthPlusl- 1 do begin 

49 CommonPrefix := OUTPUT(TempStates[j-l], CurrentWord [j]) 

A CurrentOutput; 
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50 WordSujJix := CommonPrefix~^ OUTPUT(TempStates[j-l], 

Current Word[j] ); 

51 SET_OUTPUT(TempStates[j-l], Current Word[j], 

CommonPrefix) ; 

52 for c ;= FIRST.CHAR to LAST.CHAR do begin 

53 if TRANSITION (TempStates[j],c) <> NULL then 

54 SET^OUTPU T ( TempStatesfj], c, concat( WordSuffix, 

OUTPUT(TempStates[j],c))); 

55 end; 

56 if FINAL(TempStates[j]) then begin 

57 tempSet := 0; 

58 for tempString in STATE_OUTPUT(TempStates[j]) do 

59 tempSet := tempSet U concat(WordSujfix, tempString); 

60 SET-STATE-OUTPUT(TempStates[j], tempSet); 

61 end; 

62 CurrentOutput := CommonPrefix~^ CurrentOutput; 

63 end; 

64 if Current Word = Previous Word then 

65 SET_STATE_OUTPUT(TempStates[length( Current Word)], 

STATE.OUTPUT(TempStates [length] CurrentWord)]) 

U CurrentOutput) ; 

66 else SET_OUTPUT(TempStates[PrefixLengthPlusTl], 

Current Word[PrefixLengthPlusl ], CurrentOutput); 

67 PreviousWord := CurrentWord; 

68 end; {while} 

69 { here we are minimizing the states of the last word } 

70 for i := length (CurrentWord) downto 1 do 

71 SET_TRANSITION(TempStates[i-l],PreviousWord[i], 

FindMinimized] Temp States [i])); 

72 InitialState := FindMinimized(TempStates[0] ) ; 

73 PRINT. TRANSD UCER (output, InitialState ); 

74 end. 



4 Implementation Results and Comparisons 

Based on the main algorithm for direct building of minimal automata we have 
created implementation for direct construction of minimal automaton with la- 
beled final states and minimal subsequential transducer. The results are summa- 
rized in the table bellow. We used a Bulgarian grammatical dictionary of simple 
words with about 900000 entries for the experiments. An implementation of the 
algorithm given in |3] has been used for the construction of the pseudo-minimal 
subsequential transducer. 

In [01 Mehryar Mohri reports that the construction with his method of the 
p-subsequential transducer for a 672000 entries French dictionary takes 20’ on a 
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Table 1. Comparison between different automata for the representation of a large-scale 
grammatical dictionary for Bulgarian. 



Number of lines 
Initial size 




895453 
27.5 MB 






Minimal 

Transducer 


Pseudo- 

minimal 

Transducer 


Minimal 
automaton 
with labeled 
final states 


States 


43413 


531397 


47854 


Transitions 


106809 


992412 


110791 


Codes 


16378 


16378 


6016 


P 


5 


- 


- 


Size of codes 


209K 


- 


126K 


Size of automaton 


1.3M 


- 


800K 


Construction time 


2’35” 


- 


25” 


Memory used 


5M 


108M 


2.5M 



HP/9000 755 computer. All our experiments have been performed on a 500MHz 
Pentium III personal computer with 128MB RAM. 
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Abstract. We present a new generic e-removal algorithm for weighted 
automata and transducers defined over a semiring. The algorithm can 
be used with any semiring covered by our framework and works with 
any queue discipline adopted. It can be used in particular in the case 
of unweighted automata and transducers and weighted automata and 
transducers defined over the tropical semiring. It is based on a general 
shortest-distance algorithm that we briefly describe. We give a full de- 
scription of the algorithm including its pseudocode and its running time 
complexity, discuss the more efficient case of acyclic automata, an on- 
the-fly implementation of the algorithm and an approximation algorithm 
in the case of the semirings not covered by our framework. We also illus- 
trate the use of the algorithm with several semirings. 



1 Introduction 

Weighted automata are efficient and convenient devices used in many applica- 
tions such as text, speech and image processing m- The automata obtained 
in such applications are often the result of various complex operations, some of 
them introducing the empty string e. For example, using the classical method 
of Thompson, one can construct in linear time and space a non-deterministic 
automaton with e-transitions representing a regular expression 0. 

For the most efficient use of an automaton, it is preferable to remove the e’s 
of automata since in general they induce a delay in their use. An algorithm that 
constructs an automaton B with no e’s equivalent to an input automaton A with 
e’s is called e-removal. 

Textbooks do not present e-removal of (unweighted) automata as an inde- 
pendent algorithm deserving a specific study. Instead, the algorithm is often 
mixed with other optimization algorithms such as determinization • This usu- 
ally makes the presentation of determinization more complex and the underlying 
e-removal process obscure. Since e-removal is not presented as an independent 
algorithm, it is usually not analyzed and its running time complexity not clearly 
determined. 

We present a new generic e-removal algorithm for weighted automata and 
transducers defined over a semiring. The algorithm can be used with any semiring 
covered by our framework and works with any queue discipline adopted. It can 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 230-^5] 2001. 
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be used in particular in the case of unweighted automata and transducers and 
weighted automata and transducers defined over the tropical semiring. It is based 
on a general shortest-distance algorithm that we briefiy describe. We give a 
full description of the algorithm including its pseudocode and its running time 
complexity, discuss the more efficient case of acyclic automata, an on-the-fiy 
implementation of the algorithm and an approximation algorithm in the case of 
the semirings not covered by our framework. We also illustrate the use of the 
algorithm with several semirings. 

2 Preliminaries 

Weighted automata are automata in which the transitions are labeled with 
weights in addition to the usual alphabet symbols. For various operations to 
be well-defined, the weight set needs to have the algebraic structure of a semir- 
ing p|. 

Definition 1. A system (K, ©, 0, 0, 1) is a right semiring if: 

1. (K, ©,0) is a commutative monoid with 0 as the identity element for ©, 

2. (K, ©,1) is a monoid with 1 as the identity element for 

3. © right distributes over ©.• Va, 6, c G K, (o © &) © c = (a © c) © (6 © c), 

4- 0 is an annihilator for ©.• Va G K, a © 0 = 0 © a = 0. 

Left semirings are defined in a similar way by replacing right distributivity 
with left distributivity. (K, ©, ©, 0, 1) is a semiring if both left and right distribu- 
tivity hold. Thus, more informally, a semiring is a ring that may lack negation. 
As an example, (N, +,-,0,1) is a semiring defined on the set of nonnegative 
integers N. 

A semiring (K, ©, ©, 0, 1) is said to be idempotent if for any a G K, a© a = a. 
The boolean semiring B = ({0, 1}, V, A, 0, 1) and the tropical semiring T = 
(K+ U {oo}, min, +, oo, 0) are idempotent, but (N, +, •, 0, 1) is not. 

Definition 2. A weighted automaton A = (A, Q, I, F, E, A, p) over the semiring 
'K is a 7-tuple where S is the finite alphabet of the automaton, Q is a finite 
set of states, I Q Q the set of initial states, F Q Q the set of final states, 
ACQxAxKxQa finite set of transitions, A : / — >■ IK the initial weight 
function mapping I to K, and p : A — >■ IK the final weight function mapping F 
to K. 

Given a transition e € E, we denote by i[e] its input label, w[e] its weight, 
p[e\ its origin or previous state and n[e] its destination state or next state. Given 
a state q £ Q, we denote by E[q] the set of transitions leaving q, and by E^[q] 
the set of transitions entering q. 

A path 7T = ei • • • Cfe in A is an element of E* with consecutive transitions: 
n[ei-i] = p[ei], i = 2, . . . ,k. We extend n and p to paths by setting: n[7r] = n[ek] 
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and p[k] = p[ei\. We denote by P(q, q') the set of paths from q to q' . P can be 
extended to subsets R<^ Q R' Q, by: 

P{R,R')= U P{q,q') 

qeR, q'eR' 

The labeling function i and the weight function w can also be extended to 
paths by defining the label of a path as the concatenation of the labels of its 
constituent transitions, and the weight of a path as the ^-product of the weights 
of its constituent transitions: 



i[Tr] = i[ei] ■ ■ ■ i[ek] 
w[it] = w[ei] ® ■■■ ® w[ek] 

Given a string x G 27*, we denote by P{x) the set of paths from I to F labeled 
with x: 

P{x) = {tt G P{I,F) : i[7r] = x} 

The output weight associated by A to an input string x G 27* is: 

A - X = 0 A(p[7t]) (g) w[7r] ® p(n[7r]) 

If P{x) = 0, ^ • X is defined to be 0. Note that weighted automata over the 
boolean semiring are equivalent to the classical unweighted finite automata. 

These definitions can be easily generalized to cover the case of weighted 
automata with e-transitions. An e-removal algorithm computes for any input 
weighted automaton A with e-transitions an equivalent weighted automaton B 
with no e-transition, that is such that: 



Vx G S* , A ■ X = B ■ X 



The following definitions will help us define the framework for our generic e- 
removal algorithm [7|. 

Definition 3. Let k > 0 be an integer. A commutative semiring (K, ©, G, 0, 1) 
is fc-closed if: 

fc+1 k 

VoGK, 0a" = 0a” 

n— 0 n— 0 

When k = 0, the previous expression can be rewritten as: 

Va G IK, 1 0 a = 1 

and K is then said to be bounded. Semirings such as the boolean semiring B = 
({0, 1}, V, A, 0, 1) and the tropical semiring T = (R+ U {oo}, min, 0, oo, 0) are 
bounded. 
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Definition 4. Let k > 0 he an integer, (K, ©, 0, 0, 1) a eommutative semiring, 
and A a weighted automaton over K. (K, ©, ©, 0, 1) is right A:-closed for A if for 
any eyele n of A: 

k I X k 

n—O n—0 

By definition, if K is fc-closed, then it is A:-closed for any automaton over K. 



3 Algorithm 

Let A = {E, Q, I, F, E, A, p) be a weighted automaton over the semiring K with 
e-transitions. We denote by A^^ the automaton obtained from A by removing 
all transitions not labeled with e. We present a general e-removal algorithm 
for weighted automata based on a generic shortest-distance algorithm which 
works for any semiring K A:-closed for A,^ 0- In what follows, we assume that 
K has this property. This class of semirings includes in particular the boolean 
semiring ({0, 1}, V, A, 0, 1), the tropical semiring (R+ U {oo}, min, -h, oo, 0) and 
other semirings not necessarily idempotent.0 

We present the algorithm in the case of weighted automata. The case of 
weighted transducers can be straightforwardly derived from the automata case 
by viewing a transducer as an automaton over the alphabet E U {e} x if U {e}. 
An e-transition of a transducer is then a transition labeled with (e, e). 0 

For p, q in Q, the e-distance from p to g in the automaton A is denoted by 
d[p, q] and defined as: 

d[p,q]= 0 w[tt] 

7r£P{p,q), i[K]—e 

This distance is well-defined for any pair of states (p, q) of A when K is a semiring 
/c-closed for 0. By definition, for any p € Q, d[p,p] = 1. d[p, q] is the distance 
from p to (j in Aj . 



3.1 Description and Proof 

The algorithm works in two steps. The first step consists of computing for each 

state p of the input automaton A its e-closure denoted by C[p\. 

C[p] = {(.q,w) : q e e[p], d[p,q] =w gK- {O}} 

^ The class of semirings with which our generic shortest-distance algorithm works can 
in fact be extended to that of right semirings right k-closed for A„ 0, but for reasons 
of space we do not describe this more general framework here. 

^ We have also developed an algorithm for removing only input (or output) e’s of a 
weighted transducer without any modification to the alphabet of the transducer. 
That algorithm is also based on the generic shortest-distance algorithm presented 
in the next sections. It applies to all transducers that admit an equivalent input (or 
output) e-free transducers. The complexity of the algorithm is exponential since in 
the worst case the size of the output transducer can be exponential in the input size. 
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where e[p] represents the set of states reachable from p via a path labeled with 
e. We describe this step in more detail later. 

The second step consists of modifying the outgoing transitions of each state 
p by removing those labeled with e and by adding to E[p] non-e-transitions 
leaving each state q G e[p] with their weights pre-0-multiplied by d[p,q]. The 
following is the pseudocode of the second step of the algorithm. 

e-removal(2l) 

1 for each p £ Q 

2 do E[p] ^ {e G E[p] : i[e] yf e} 

3 for each {q,w) G C[p] 

4 do E[p] £- E[p] U {{p, a,w ® w' , r) : (g, a, w' , r) G E[q],a yf e} 

5 if (j G F 

6 then \f p ^ F 

7 then F ^ F U {p} 

8 p[p] ^ p[p] ® {w (S) p[q\) 

State p is a final state if some state q G e[p] is final and the final weight p[p] is 
then: 

P[P\= 0 (d[Pi <l] ^ pIq]) 

qee[p]nF 



Theorem 1. Let A = {E,Q,I,F,E,X,p) be a weighted automaton over the 
semiring K right k-elosed for A. Then the weighted automaton B result of the 
e-removal algorithm just deserihed is equivalent to A. 

Proof. We show that the function defined by the weighted automaton A is not 
modified by the application of one step of the loop of the pseudocode above. 
Let A = {E,Q, I, F, E, X, p) be the automaton just before the removal of the 
e-transitions leaving state p £ Q (lines 2-8). 

Let X £ S* , and let Q{p) denote the set of successful paths labeled with x 
passing through p and either ending at p or following an e-transition of p. By 
commutativity of ©, we have: 

A - X = 0 A(p[7t]) (g) w[7r] © p(n[7r]) © 0 X{p[tt]) © w[7t] © p(n[7r]) 

ir&P{x)-Q{p) t^^Q(p) 

Denote by S\ and S 2 the first and second term of that sum. The e-removal at 
state p does not affect the paths not in Q{p), thus we can limit our attention to 
the second term 82 - A path in Q{p) can be factored in: tt = 7ri7re7T2, where tt^ is 
a portion of the path from p to n[7rg] = q labeled with e. The distributivity of © 
over © gives us: 

= ( 0 A(p[7Ti]) © w[7Ti])) © S 22 

n[TTi]=p 



with: 
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822 = i 0 w[7Te]) ® 0 w['K 2 \ ® p{n[T^ 2 \) 

p[Trc]=p,n[TTc]=q p['^2\=q 

= d[p, (?] ® 0 w[k 2 ] ® p(n[7T2]) 
p[TT 2 \ = q 

= 0 {d[p, q] (g) w[ 7T2]) (g) p(n[7T2]) 

p[7i-2]=g 



For paths tt such that tt 2 = e: 

822= 0 {d[p, q] p{q)) 

p['^2]=q, 96 c [ p ] 

which is exactly the final weight associated to p by e-removal at p. Otherwise, 
by definition of tt^, 7T2 does not start with an e-transition. The second term of 
the sum defining 822 can thus be rewritten as: 

0 {d[p, q] 0 w[e]) 0 w[tt' 2 ] 0 p{n['K 2 ]) 

e^E[q], i[e]^e, i:2—e.T^2 

The e-removal step at state p follows exactly that decomposition. It adds non 
e-transitions e leaving q with weights (d[p, 9 ] 0rc[e]) to the transitions leaving p. 
This ends the proof of the theorem. □ 

After removing e’s at each state p, some states may become inaccessible 
if they could only be reached by e-transitions originally. Those states can be 
removed from the result in time linear in the size of the resulting machine using 
for example a depth-first search of the automaton. 

Figures IH(a)-(c) illustrate the use of the algorithm in the specific case of the 
tropical semiring, e-transitions can be removed in at least two ways: in the way 
described above, or the reverse. The latter is equivalent to applying e-removal to 
the reverse of the automaton and re-reversing the result. The two methods may 
lead to results of very different sizes as illustrated by the figures. This is due to 
two factors that are independent of the semiring: 

— the number of states in the original automaton whose incoming transitions 
(or outgoing transitions in the reverse case) are all labeled with the empty 
string. As mentioned before, those states can be removed from the result. 
For example, state 3 of the automaton of figure ^ (a) can only be reached 
by e-transitions and admits only outgoing transitions labeled with e. Thus, 
that state does not appear in the result in both methods (figures Q(b)-(c)). 
The incoming transitions of state 2 are all labeled with e and thus it does 
not appear in the result of the e-removal with the first method, but it does 
in the reverse method because the outgoing transitions of state 2 are not all 
labeled with e; 

— the total number of non e-transitions of the states that can be reached from 
each state q in (the reverse of in the reverse case). This corresponds 
to the number of outgoing transitions of q in the result of e-removal. 
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Fig. 1. e-removal in the tropical semiring, (a) Weighted automaton A with e- 
transitions. (b) Weighted automaton B equivalent to A result of the e-removal algo- 
rithm. (c) Weighted automaton C equivalent to A obtained by application of e-removal 
to the reverse of A. 



In practice, one can use some heuristics to reduce the number of states and 
transitions of the resulting machine although this will not affect the worst case 
complexity of the algorithm. One can for instance remove some e-transitions 
in the reverse way when that creates less transitions and others in the way 
corresponding to the first method when that helps reducing the resulting size. 

Figures El (a)-(b) illustrate the algorithm in the case of another semiring, the 
semiring of real numbers. Our general algorithm applies in this case since is 
acyclic. 

3.2 Computation of e-Closures 

As noticed before, the computation of e-closures is equivalent to that of all-pairs 
shortest-distances over the semiring K in A^. There exists a generalization of the 
algorithm of Floyd- Warshall (am for computing the all-pairs shortest-distances 
over a semiring K under some general conditions 00 However, the running time 
complexity of that algorithm is cubic: 

0{\Q\^{Tf^ + T^ + T*)) 

where Tq, T^, and T, denote the cost of 0, 0, and closure operations in the 
semiring considered. The algorithm can be improved by first decomposing 



® That algorithm works in particular with the semiring (R, +, *, 0, 1) when the weight 
of each cycle of Ae admits a well-defined closure. 
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a/0.012 a/0.300 




Fig. 2. e-removal in the real semiring (K, *, 0, 1). (a) Weighted automaton A. is 

fe-closed for (R, -f, *, 0, 1). (b) Weighted automaton B equivalent to A output of the 
e-removal algorithm. 



into its strongly connected components, and then computing all-pairs shortest 
distances in each components visited in reverse topological order. However, it is 
still impractical for large automata when has large cycles of several thousand 
states. The quadratic space complexity 0(|Qp) of the algorithm also makes it 
prohibitive for such large automata. Another problem with this generalization 
of the algorithm of Floyd-Warshall is that it does not exploit the sparseness of 
the input automaton. 

There exists a generic single-source shortest-distance algorithm that works 
with any semiring covered by our framework ■ The algorithm is a generalization 
of the classical shortest-paths algorithms to the case of the semirings of this 
framework. 

This generalization is not trivial and does not require the semiring to be 
idempotent. In particular, a straightforward extension of the classical algorithms 
based on a relaxation technique would not produce the correct result in general 
||Z]. The algorithm is also generic in the sense that it works with any queue 
discipline. The following is the pseudocode of the algorithm. 

A queue S is used to maintain the set of states whose leaving transitions are 
to be relaxed. S is initialized to {s} (line 4). For each state q G Q, two attributes 
are maintained: d[g] G K an estimate of the shortest distance from s to q, and 
r[(/] G IK the total weight added to d[q] since the last time q was extracted from 
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Generic-Single-Source-Shortest-Distance {B, s) 

1 for each p & Q 

2 do d[p] r[p] 0 

3 d[s] r[s] <— 1 

4 

5 while S 7 ^ 0 



6 do 

7 

8 

9 

10 
11 
12 

13 

14 

15 

16 d[s] 1 



q head{S) 

Dequeue(S') 
r <— r\q\ 
r[q\ ■«— 0 

for each e £ E[q\ 

do if d[n[e]] / d[n[e]] © (r ® w[e]) 

then d[n[e]] <— d[n[e]] © (r © w[e]) 
r[n[e]] <— r[n[e]] © (r © w[e]) 
if n[e] ^ S 

then Enqueue(S', n[e]) 



Fig. 3. Generic single-source shortest-distance algorithm. 



S. Lines 1 — 3 initialize arrays d and r. After initialization, d[q] = r[( 7 ] = 0 for 
q £ Q — {s}, and d[s] = r[s] = 1. 

Given a state q £ Q and an transition e £ E[q], a relaxation step on e is 
performed by lines 11-13 of the pseudocode, where r is the value of r[q] just 
after the latest extraction of q from S' if q has ever been extracted from S, its 
initialization value otherwise. 

Each time through the while loop of lines 5-15, a state q is extracted from 
S (lines 6-7). The value of r[q] just after extraction of q is stored in r, and then 
r[q] is set to 0 (lines 8-9). Lines 11-13 relax each transition leaving q. If the 
tentative shortest distance d[n[e]] is updated during the relaxation and if n[e] is 
not already in S, the state n[e] is inserted in S so that its leaving transitions be 
later relaxed (lines 14-15). r[n[e]] is updated whenever d[n[e]] is. r[n[e]] stores 
the total weight ©-added to d[n[e]] since n[e] was last extracted from S or since 
the time after initialization if n[e] has never been extracted from S. Finally, line 
16 resets the value of d[s] to 1. 

In the general case, the complexity of the algorithm depends on the semiring 
considered and the queue discipline chosen for S: 



0{\Q\ + (Te + T^ + C{A))\E\ m^iV(g) + (C(/) + C{E)) ^ N{q)) 
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where N{q) denotes the number of times state q is extracted from S, C{E) the 
worst cost of removing a state q from the queue S, C{I) that of inserting q in 
S, and C{A) the cost of an assignment. 0 

In the case of the tropical semiring (K+ U {oo}, min, +, oo, 0), the algorithm 
coincides with classical single-source shortest-paths algorithms. In particular, it 
coincides with Bellman-Ford’s algorithm when a FIFO queue discipline is used 
and with Dijkstra’s algorithm when a shortest-first queue discipline is used. 
Using Fibonacci heaps Pj, the complexity of Dijkstra’s algorithm in the tropical 
semiring is: 

0(|U| + |Q|log|Q|) 

The complexity of the algorithm is linear in the case of an acyclic automaton 
with a topological order queue discipline: 

0{\Q\ + (Tq -I- T0)\E\) 

Note that the topological order queue discipline can be generalized to the case of 
non-acyclic automata by decomposing into its strongly connected compo- 
nents. Any queue discipline can then be used to compute the all-pairs shortest 
distances within each strongly connected component Q. 

The all-pairs shortest-distances of A^ can be computed by running \Q\ times 
the generic single-source shortest-distance algorithm. Thus, when is acyclic, 
that is when A admits no e-cycle, then the all-pairs shortest distances can be 
computed in quadratic time: 

0{\Q\^ + {T^ + T^)\Q\-\E\) 

When Ag is acyclic, the complexity of the computation of the all-pairs shortest 
distances can be substantially improved if the states of Ag are visited in reverse 
topological order and if the shortest-distance algorithm is interleaved with the 
actual removal of e’s. Indeed, one can proceed in the following way. For each 
state p of Ae visited in reverse topological order: 

1. run a single-source shortest-distance algorithm with source p to compute the 
distance from p to each state q reachable from p by e’s; 

2. remove the e-transitions leaving q as described in the previous section. 

The reverse topological order guarantees that the e-paths leaving p are reduced 
to the e-transitions leaving p. Thus, the cost of the shortest-distance algorithm 
run from p only depends on the number of e-transitions leaving p and the total 
cost of the computation of the shortest-distances is linear: 

0{\Q\ + (Tq -I- T0)\E\) 

In the case of the tropical semiring and using Fibonacci heaps, the complexity 
of the first stage of the algorithm is: 

0{\Q\-\E\ + \Q\^log\Q\) 

^ This includes the potential cost of reorganization of the queue to perform this as- 
signment. 
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In the worst case, in the second stage of the algorithm each state q belongs to 
the e-closure of each state p, and the removal of e’s can create in the order of 
\E\ transitions at each state. Hence, the complexity of the second stage of the 
algorithm is: 

0{\Q\^ + \Q\-\E\) 

Thus, the total complexity of the algorithm in the case of an acyclic automaton 

is: 

0{\Q\^ + {T^ + T^)\Q\-\E\) 

In the case of the tropical semiring and with a non acyclic automaton A^, the 
total complexity of the algorithm is: 

0(lQl-lEl + lQl^loglQl) 

4 Remarks and Experiments 

We have fully implemented the e-removal algorithm presented here and incorpo- 
rated it in recent versions of the FSM library |Hj. For some automata of about 
a thousand states with large e-cycles, our implementation is up to 600 times 
faster than the previous implementation based on a generalization of the Floyd- 
Warshall algorithm. 

An important feature of our algorithm is that it admits a natural on-the- 
fly implementation. Indeed, the outgoing transitions of state q of the output 
automaton can be computed directly using the e-closure of q. However, with an 
on-the-fly implementation, a topological order cannot be used for the queue S 
even if is acyclic since this is not known ahead of the time. Thus, we have 
implemented both an off-line and an on-the-fly version of the algorithm. 

The algorithms presented in textbooks often combine the classical powerset 
construction, or determinization, and e-removal PJ. Each state of the resulting 
automaton corresponds to the e-closure of a subset of states constructed as in 
determinization. 

If one wishes to apply determinization after e-removal, then the integration 
of e-removal within determinization may often be more efficient. This is because 
to compute the e-closure of a subset created by determinization, one can run a 
single shortest-distance algorithm from the states of that subset rather than a 
distinct one for each state of the subset. Since some states can be reached by 
several elements of the subset, the first method provides more sharing. This is 
also corroborated by experiments carried out on various automata in the context 
of natural language processing applications cm. 

This integration of determinization and e-removal can be extended to 
the weighted case if one replaces the classical unweighted determinization by 
weighted determinization pj and if one uses the weighted e-closure presented 
in the previous sections. It is however limited to the cases where the weighted 
automaton resulting from e-removal is determinizable since in the weighted case 
this is not always guaranteed. 
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Notice, however, that in the integrated algorithm presented in textbooks, the 
computation of the e-closure is limited to the use of a FIFO queue discipline p. 
Thus, it is a special case of our general algorithm which may be less efficient 
both in terms of complexity and in practice than using a shortest-first queue 
discipline as in Dijkstra’s algorithm, or a topological order in the case of acyclic 
automata. Our experiments confirm that in the case of acyclic automata, the e- 
removal based on a topological order queue discipline can be many times faster 
than the one based on a shortest-first queue discipline, which itself is faster than 
the one based on a FIFO queue discipline. 

The algorithm we presented can be straightforwardly modified to remove 
transitions with a label a different from e. This can be done for example by 
replacing e by a new label and a by e, applying e-removal and then restoring 
original e’s. 

The shortest-distance algorithm presented in the previous sections admits 
an approximation version where the equality of line 11 in the pseudocode of 
figure 0 is replaced by an approximate equality modulo some predefined con- 
stant (5 [Z|. This can be used to remove e-transitions of a weighted automaton 
A over a semiring K such as (R, -I-, *, 0, 1), even when K is not fc-closed for 
Although the transition weights are then only approximations of the correct 
results, this may be satisfactory for many practical purposes such as speech pro- 
cessing applications. Furthermore, one can arbitrarily improve the quality of the 
approximation by reducing S. As mentioned before, one can also use a general- 
ization of the Floyd- Warshall algorithm to compute the result in an exact way 
in the case of machines Ag with relatively small strongly connected components 
and when the weight of each cycle of admits a well-defined closure. 

5 Conclusion 

A generic algorithm for e-removal of weighted automata was given. The algorithm 
works with any semiring covered by our framework. In particular, it can work 
with semirings that are not necessarily idempotent. It works with any weighted 
automaton or transducer over the tropical semiring or the boolean semiring. It 
was shown to be more efficient than the existing algorithm both in space and 
time complexity. Experiments confirm this improvement of efficiency in practice. 
The algorithm admits a natural on-the-fly implementation which can be com- 
bined with other on-the-fly algorithms such as determinization or composition 
of weighted automata to optimize weighted automata. 
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Abstract. Cover automata were introduced in [Q as an efficient repre- 
sentation of finite languages. In PP, an algorithm was given to transform 
a DFA that accepts a finite language to a minimal deterministic finite 
cover automaton (DFCA) with the time complexity where n is 

the number of states of the given DFA. In this paper, we introduce a 
new efficient transformation algorithm with the time complexity O(n^), 
which is a significant improvement from the previous algorithm. 



1 Introduction 

Finite languages have many practical applications m- However, the finite lan- 
guages used in applications are generally very large, which need thousands or 
even millions of states if represented by deterministic finite automata (DFA) or 
similar structures. In deterministic finite cover automata (DFCA) were in- 
troduced as an alternative representation of finite languages. Experiments have 
shown that, in many cases, DFCA are much smaller in size than their corre- 
sponding minimal DFA [Sj . 

Let L be a finite language and I the length of the longest word(s) in L. 
Intuitively, a DFCA A for L is a DFA that accepts all words in L and possibly 
additional words of length greater than 1. So, a word w is in L if and only if it is 
accepted by A (as a DFA) and it has a length less than or equal to 1. Note that 
checking the length of a word is usually not an extra burden in practice since 
the length of an input word is kept anyway in most applications. 

In order to explain intuitively the notion of a DFCA, we give a very sim- 
ple example in the following. Let U = {a, b, c} be the alphabet and L = 
{abc, ababc, abababc} a finite language over S. Clearly, the length of the longest 
word in L is 7, i.e., I = 7. The minimal DFA accepting L is shown in Figure Q 
which has 8 states (9 if complete). A minimal DFCA is shown in Figure |3 which 
has only 4 states (5 if complete). 

In an algorithm was given for constructing a minimal DFCA from a 
given DFA that accepts a finite language. The time complexity of the algorithm 
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Research Council of Canada grants OGP0041630 and a graduate scholarship. 
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Fig. 1. The minimal DFA accepting L 




Fig. 2. A minimal DFCA for L with I = 7 



is O(n^), where n is the number of states of the DFA. Note that the number of 
transitions of a DFA is linear to its number of states. In this paper, we give an 
0{ii?) algorithm for the construction of a minimal DFCA from a given DFA. The 
new algorithm is not only a significant improvement from the previous algorithm 
m in time complexity, but also much easier to comprehend and to implement. 

The two algorithms differ mainly at how to compute the similarity (or dis- 
similarity) relation between states. The new algorithm computes the pairs of 
states that are dissimilar and propagates the dissimilarity relations, rather than 
to compute directly the similarity relation as in the algorithm in Q. A new al- 
gorithm is also given for merging similar states, which is simpler than the one 
given in PP . We also prove several new theorems on the similarity relation which 
form the theoretical basis of the new algorithm. 

In the next section, we give the basic definitions and notation, as well as 
the basic results, on cover languages and automata. In Section 3, we prove two 
theorems which are essential to the new algorithm. In Section 4, we describe our 
new algorithm and analyze its complexity. In the last section, we conclude the 
paper. 

2 Preliminaries 

First, we give the basic definitions and notation for cover languages, cover au- 
tomata, and the similarity relation. Then we list some basic results, which are 
relevant to this paper, without giving any proofs. Detailed explanations and 
proofs can be found in P or jS|. 

Let S' be a finite set and n a nonnegative integer. By S-” we denote U(LoSL 

Definition 1. Let L C S* be a finite language over an alphabet S and I the 
length of the longest word(s) in L. A language L' over S is ealled a eover lan- 
guage ofL ifL'nS^^ = L. 
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Definition 2. A cover automaton for a finite language L is a finite automaton 
A such that the language accepted by A, i.e., L{A), is a cover language of L. If 
A is a DFA, then A is called a deterministic finite cover automaton (DFCA) for 
L. 



We often use the term cover automaton casually to mean DFCA in this paper. 
In the following, we give the basic definitions regarding the similarity relation. 
We first define the similarity relation on words respect to a finite language, and 
then the similarity relation on states of a DFA that accepts a finite language. 
The notion of similarity between words was first introduced in Q, and then 
studied in Q, ini, etc. The concept of the similarity relation on words is the 
basis for the similarity relation on states of a DFA. 

Definition 3. Let L he a finite language over the alphabet S and I the length 
of the longest word(s) in L. Let x,y € E* . We define the following relation: 

(1) X y if for all z € E* such that \xz\<l and \ yz \< l,xz € L iff yz G L; 

(2) X'/'Ly if X'^Ly does not hold. 

The relation is called the similarity relation with respect to L. We will use 
X ^ y instead of x y when L is clearly understood from the context. Note 
that the relation is reflexive, symmetric, but NOT transitive. 

Lemma 1. Let L C E* be a finite language and x,y,z G E* , |a;| < |y| < \z\. 
The following statements hold: 

1. If X y, X z, then y z. 

2. If X y, y Z, then x z. 

3. If X'^L y, yf'LZ, then xf^i^z. 

Definition 4. Let L G E* be a finite language. 

1. A sequence of words (xi, . . . ,x„) over E is called a dissimilar sequence of L 
if Xi /l Xj for each pair i,j, 1 < i, j < n and if^j. 

2. A dissimilar sequence (xi, . . . , x„) of L is called a maximal dissimilar se- 
quence of L if for any dissimilar sequence (j/i, . . . , ym) of L, m < n. 

In the following, we define the similarity relation on the set of states of a 
DFA or a DFCA. Note that if a DFA A accepts a finite language L, then A is 
also a DFCA for L. 

Definition 5. Let A = {Q,E,S,s,F) be a DFA (or a DFCA). We define, for 
each state q € Q, 

level{q) = min{\w\ \ 5{s,w) = q\, 

i.e., level{q) is the length of the shortest path (in the directed graph associated 
with the automaton) from the initial state to q. 
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Definition 6. Let A = S,S, s, F) be a DFCA for a finite language L with 
I being the longest word(s) in L. Letp,q G Q and m = Tnsx.{level{p),level{q)} . 
We say that p '^a q if for every w G , S{p,w) G F iff 5{q,w) G F. 

We use the notation p ^ q instead of p q whenever L is clearly understood 

from the context. 

We are now ready to state the theorem that is the basis for any algorithm 
for the minimization of DFCAs. 

Theorem 1. Let A = (Q, E, S, s, F) be a DFCA for a finite language L. Assume 
that p q for some p,q € Q such that p ^ q and level{p) < level{q). Then 
we can construct a DFCA A' = {Q' , E, S', s, F') for L such that Q' = Q — {q}, 
F' = F — {q}, and 



' ’ ^p otherwise 
for each t G Q' and a G E. 

Definition 7. A DFCA A for a finite language is a minimal DFCA if and only 
if no two different states of A are similar. 

Theorem 2. For a finite language L, there is a unique number N(L) such that 
any minimal DFCA for L has exactly N(L) states. 

Please refer to P for the proofs of the above theorems. 

Definition 8. Let A = {Q, E, S, s, F) be a DFA or a DFCA for a finite language 
L with I be the length of the longest word(s) in L. For p G Q, denote by Xp a 
shortest word in E* such that 6{s,Xp) = p; Xp is called a “representative” of p. 

Note that for each q G Q, \xq\ = level{q). 

Theorem 3. p q if and only if Xp Xq. 

One may refer to P for a proof. 

3 The New Algorithm 

Given a DFA that accepts a finite language, we can construct a minimal DFCA 
for the given language in two steps: (1) compute the similarity relation between 
the states of the DFA, and (2) merge similar states. Note that the similarity 
relation is not transitive. So, li p q and q r, we cannot simply merge 
p, q, and r together in general. Step (1) is the most complex one. A naive 
algorithm for determine whether p ~ g is to check whether S{p, z), S{q, z) G F or 
S{p,z),S{q,z) G Q — F for all words 2 ; such that \z\ < l — ma,x{level{p),level{q)). 
This would need exponential time. In the algorithm given in [Q, it needs 0{n^) 
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time to determine whether two states are similar. The time complexity of the 
entire step (1) of that algorithm is O(n^). 

Here we use a different approach and the time complexity of our algorithm is 
0(n^). In this section, we will describe our new algorithm. However, before de- 
scribing the algorithm, we have to give several new definitions and prove several 
new results. 



3.1 New Definitions and Results 

Again we assume that A = {Q, E, 6, s, F) is a DFA accepting a finite language L 
over the alphabet E and I is the length of the longest word(s) in L. We assume 
that A is a complete DFA and there is no useless state in Q except the sink state 
d, i.e., for each q G Q — {d}, there exist u,v G E* such that S{s,u) = q and 
S{q,v) G F. 

Definition 9. For p,q G Q and q, we define 

range{p,q) = I — ma,x{level{p),level{q)}. 

Intuitively, range{p, q) is the maximum length of a word w that satisfies both 
\xpw\ < I and \xqw\ < 1. 

Definition 10. Let p,q G Q and z G E* . We say that p and q fail on z if 
S{p, z) G F and S{q, z) G Q — F or vice versa, and \z\ < range{p, q). 

Theorem 4. p q if and only if there exists z G E* such that p and q fail on 

z. 

Definition 11. If p q, we define 

gap{p, q) = min{\z\ \ p and q fail on z}. 

If p / < 7 , then gap{p,q), intuitively, is the length of the shortest word(s) that 
can show that p and q are dissimilar. It is clear that gap{p,q) = gap{q,p) and 
gapfp, q) < I for any p,q G Q such that p 'f' q. For convenience, we define 
gapfp, q) = I a p ^ q. The next theorem is clear. 

Theorem 5. 

(1) Let d he the sink state of A. If level{d) > I, then d ^ q for each q G Q — {d}. 

If level{d) < I, then d f and gap{d, /) = 0 for each f G F. 

(2) If p G F and q G Q — F — {d} or vice versa, then p q and gap{p, q) = 0. 



Lemma 2. Let p,q G Q, p q, and r = 5{p, a) and t = 5{q,a), for some a G E. 
Then range{p, q) < rangefr, t) -I- 1. 
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Proof. It is clear that levelfr) < level{p) + 1 and level{t) < level{q) + 1. So, 
m.ax{level{r),level(t)) < m.ax{level{p),level{q)) + 1. 

Then, by Definition El range{p, q) < range{r,t) + 1. □ 



Theorem 6. Let p and q be two states such that either p,q G F or p, q € Q — F . 
Then p q if and only if there exists a € S such that 5{p, a) = r and S{q, a) = t, 
r 'f' t, and 

gap{r,t) + 1 < range{p,q). 

Proof. Only if: We assume that p q and will show that there exists a pair (r, t) 
satisfying the conditions of the theorem. Choose z G S* such that p and q fail on 
z and \z\ = gap{p, q) . Note that |z| > 0 because of the given condition of p and q. 
By the definition of gap function, we know that \z\ < range{p,q). Without loss 
of generality, we assume that 6{p, z) G F and 5{q, z) G Q — F. Let z = az' . Then 
S(p,az') = S(r,z') G F and 6{q,az') = 5{t,z') G Q — F ior some r,t G Q. By 
Lemma 121 we know that range{r,t) > range{p,q) — 1. Then \z'\ < range{r,t)- 
By Definition M r and t fail on z' and r t. Since gap(r,t) < \z'\, we have 
gap{r,t) + 1 < range(p, q). 

If: Assume that there exists a G E such that 5{p, a) = r, S{q, a) = t, r t, and 
gap(r, t) + 1 < range{p, q). Then there is z' G S* such that r and t fail on z' and 
\z'\ = gap{r,t). Let z = az'. Then |z| = gap{r,t) + 1 and thus |z| < range{p,q). 
Therefore, p and q fail on z. In other words, p q. □ 

The following theorem gives a formula which computes gap{p, q) for two state 
p and q that are either both final states or both non-final states. 

Theorem 7. If p q such that p,q G F or p,q G Q — F, then 

gap{p, q) = min{gap(r, t) -|- 1 | 5{p, a) = r and 6{q, a) = t, for a G S, 
r t, and gap{r, t) -I- 1 < range{p, g)}. 

Proof. We first prove that gap{p, q) < gap{r, t) -I- 1 for every pair r,t G Q such 
that S{p, a) = r, S{q, a) = t, r t, and gap{r, t) -1-1 < range{p, q). Let (r, t) be an 
arbitrary pair that satisfy the above conditions. Since r t, there exists z' such 
that r and t fail on z' and \z'\ = gap(r, t). It is also clear that |az'| < range{p, q) 
since gap{r,t) -I- 1 < range{p,q). Then p and q fail on z = az'. By definition, 
gap{p,q) < |z|. So, we have gap{p,q) < gap(r,t) + 1. 

We now prove the other direction, i.e., there exist r,t G Q such that 6{p, a) = r, 
S{q, a) = t, r t, and gap{r, t) -I- 1 < gap{p, q). Let z G E* such that p and q fail 
on z and |z| = gap{p, q). Clearly, |z| > 0 by the given conditions. Then z = az' for 
some a G E. Let 5{p, a) = r and 5{q, a) = t. Then range{p, q) < range{r, t) + l by 
LemmaEl Thus, \z'\ < range{r,t)- Then clearly r and t fail on z'. By definition, 
gap{r,t) E \z'\. Then we have gap{r,t) < — 1 = gap{p,q) — 1. □ 
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3.2 The Algorithm 

The algorithm consists of two main parts: the first is to determine the similarity 
relation between states; the second is to merge similar states. 

In the first part of the algorithm, we determine the similarity relation by 
computing the gap function starting from the sink state and along the inverse 
direction of the transitions of the given DFA. Note that the construction of 
a minimal DFCA is different from the minimization of a acyclic DFA. In the 
latter case, if two states have different heights then they are not equivalent. 
(The height of a state is the length of the longest path starting from this state 
to a final state.) However, in the former, it is possible that two states are similar 
even if they have different heights. The equivalence relation is a refinement of 
the similarity relation with respect to a finite language. 

In the following, we assume that the given DFA accepting a finite language is 
complete (a transition is defined for each state and each letter in the alphabet) 
and reduced (no useless states except one sink state). We also assume that the 
given DFA is ordered, i.e., the n + 1 states (including the sink state) of the DFA 
are numbered by 0, 1, . . . , n such that there is no transition from state j to state 
i ii 0 < i < j < n. This implies that 0 is the starting state, n is the sink state, 
and n-1 is the last final state. All the above pre-conditions can be achieved in 
linear time in terms of the number of states of the given DFA. Note that the size 
of a DFA is linear to its number of states. 

Algorithm for computing the gap function 

Input: An ordered, reduced, and complete DFA A = {Q, E, 6, 0, F), with n + 1 
states, which accepts a finite language L, and the length I of the longest word 
in L 

Output: gap{i,j) for each pair i,j € Q and i < j 

Algorithm: 

1. For each i G Q compute level{i) end for; 

2. for i = 0 to n — 1 do gap{i, n) = I end for; 
if level{n) < I then 

for each i G F gap{i, n) = 0 end for 
end if; 

3. for each pair i, j G Q — {n} such that i < j 

\i i G F and j G Q — F or vice versa then 
gap{i,j) = 0; 

else 

gap{i,j) = 1; 
end if; 
end for; 

4. for i = n — 2 down to 0 do 

for j = n down to z -I- 1 do 
for each a G A do 

let z' = S{i, a) and j' = S{j, a); 
if z' yf j' then 

5 = if (z' < /) then gap{i',f) else gap{f,i')] 
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if 5 + 1 < range{i, j) then 

gap(i,j) = min(gap(i,j), g+1); 
end if; 
end if; 
end for; 
end for; 
end for 

Algorithm for merging similar states 

Input: A ordered, reduced, and complete DFA A = {Q, S, S, 0, F) which accepts 
a finite language L, and gap{i,j) for each pair i,j € Q and i < j 
Output: A minimal DFCA A' for L 

Algorithm: 

1. Let P[0..n] be a Boolean array with each P[i], 0 < i < n, initialized to false; 

2. for i = 0 to n — 1 do 

if P[i] == false then 

for j = i + 1 to n do 

if P[j] == false and gap{i,j) = I then 
merge j to i; 

P[j] = true; 
end if; 
end for; 
end if; 
end for. 

For convenience, we assume that the number of states is n + 1 in the above 
algorithm and there is at least one state in A. Thus n = 0 if there is only one 
state in A. 

The step “merge j to i;” follows the steps described in Theorem^ 

The correctness of the algorithm can be easily established with Theorem 0 
Theorem 0 and Theorem 0 So, we omit the formal proof here. 

We now consider the time complexity of the algorithm. In the first part, each of 
Step 1 and Step 2 is 0{n). Clearly, Step 3 takes O(n^) iterations. Step 4 is the 
main part, which has two nested loops, each of which has 0{n) iterations. Each 
inner iteration is 0(| A|), where |i7| is a constant. Therefore, the first part of the 
algorithm, that computes the gap function, is O(n^). Clearly, the second part is 
also O(n^). So, the time complexity of the algorithm is O(n^). 

4 Concluding Remarks 

We have shown an 0(nf) algorithm for constructing a minimal DFCA for a 
finite language given in the form of a DFA. This is a significant improvement 
from the 0{n‘^) algorithm given in Q. This new algorithm is also much easy to 
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comprehend and implement. The algorithm can be modified into a minimization 
algorithm for general DFCA. 

In the future, we will conduct more experiments on DFCA with finite lan- 
guages from real-world applications. It is important to know how much reduction 
on the size of the automata one can achieve by using DFCA instead of DFA. 
We believe that the reduction can be large for certain types of applications, but 
minor on others. 
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Abstract. In this paper we study the costs, in terms of states, of some 
basic operations on regular languages, in the unary case, namely in the 
case of languages defined over a one letter alphabet. In particular, we 
concentrate our attention on the concatenation. The costs, which are 
proved to be tight, are given by explicitly indicating the number of states 
in the noncyclic and in the cyclic parts of the resulting automata. 



1 Introduction 

Finite automata are one of the first computational models presented in the lit- 
erature and, certainly, one of the most extensively investigated. However, some 
problems concerning these simple models are still open and the investigation of 
some aspects of the finite automata world is only at the beginning. 

For instance, many complexity results for finite automata are given under 
the hypothesis that the input alphabet contains at least two symbols. As an 
example, consider the simulation of an n-state nondeterministic automaton by 
an equivalent 2"-state deterministic one: it is optimal when the input alphabet 
contains at least two symbols |E|, but, as shown in 1986 by Chrobak |2|, its cost 
can be reduced in the unary case, namely for automata with a one letter input 
alphabet. 

Very recently, the investigation of the unary case was reconsidered by pointing 
out many differences between the world of unary automata and the universe of 
all finite automata Pj. More generally, differences between the unary and the 
general case have been shown not only for finite automata and regular languages, 
but even for other classes of machines and languages (see, e.g., |3f,5f9) 1 . 

In this paper, we study the state complexity of some operations on unary 
languages. We recall that state complexity is a type of descriptional complexity 
for regular languages, based on deterministic finite automata. More precisely, the 
state complexity of a regular language is the number of states of the minimum 
automaton accepting it. Furthermore, as pointed out by Shallit and Breitbart 
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metodi sintattici e combinatori” . 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 252-gS5| 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 



Unary Language Concatenation and Its State Complexity 253 



| mH , state complexity can be extended in order to estimate even the costs of the 
descriptions of nonregular languages. 

In order to evaluate the state complexity of an operation as, for instance, 
the concatenation, we have to express the number of states of the minimum 
deterministic automaton accepting the concatenation of two languages L' and 
L", as a function of the numbers of states of the minimal deterministic automata 
accepting L' and L” . 

Tight evaluations of the state complexity of the concatenation and of the star 
of regular languages were obtained by Yu, Zhuang, and Salomaa m- For both 
operations, the state complexity of the resulting language can be exponential. 
On the other hand, as pointed out in m the state complexity of the union 
and of the intersection of two regular languages is the product of their state 
complexities. However, as shown in the same papers, if L' and L" are unary 
languages accepted by two unary deterministic automata with m and n states, 
respectively, such that the numbers m and n are relatively prime, then the 
worst case state complexity of L'L” , V fl L", and V U L" is mn, while the 
state complexity of L'* is (m — 1)^ + 1. Recently, the same operations on unary 
regular languages were considered by C. Nicaud |B|, obtaining estimations of 
their average state complexities. 

Since the transition graph of a unary deterministic automaton A consists of 
an initial path of /i > 0 states which is followed by a cycle of A > 1 states (the 
pair (A,/r) will be called size of H), to study the state complexity in the unary 
case it seems quite natural to taking into account not only the total number of 
states, but even how many of them are in the cyclic part and how many of them 
are in the initial path. 

Recently, J. Shallit considered the union and the intersection of unary regular 
languages: he proved that if L' and L” are accepted by two automata A' and 
A" of size (A',/i') and (A",^"), respectively, then both L' U L" and L' fl L" are 
accepted by automata of size (A,/i), where A is the least common multiple of 
X' and X" , and /i is the maximum between /r' and /i". Furthermore, this upper 
bound is tight m- In this paper, we consider the concatenation of L' and L” . In 
Section 0 we prove that the language L'L" is accepted by an automaton of size 
(A,/r), where A is, as for the union and intersection, the least common multiple 
of A' and A", while /r = ^' + ^" + A — 1. We also consider some particular 
situations, e.g., all the states on the initial path of A! or of A!' are nonfinal, 
or A' and A" are relatively prime. We prove that in these cases the size of the 
resulting automaton can be further reduced. For instance, when X! and A" are 
relatively prime and both L', L” are infinite, the length of the cycle in the 
minimum automaton accepting V L" reduces to one. With few exceptions, these 
estimations are shown to be tight, /or all possible sizes of the given automata. 
In Table 0 and Table Q at the end of Section 0| all these results concerning the 
state complexity of the concatenation are summarized. 

We conclude the paper by presenting, in Section 0 some considerations con- 
cerning the star operation. 
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For brevity reasons, some of the proofs are omitted or just outlined in this 
version of the paper. 

2 Preliminary Notions and Results 

In this section, we recall basic notions, notations and facts used in the paper. 

Given two integers a, 6 > 0, we denote by gcd(a, 6) and by lcm(a, 6), their 
greatest common divisor and their least common multiple, respectively. The fol- 
lowing result will be crucial in order to evaluate the number of states of unary 
automata: 

Lemma 1. Given two integers a,b > 0, each number of the form ax + by, with 
x,y>0, is a multiple of gcd{a,b). Furthermore, the largest multiple of gcd{a,b) 
that cannot he represented as ax + by, with x,y > 0, is lcm(a, b) — {a + b). 

Proof. It is well-known that each number z = ax + by is a, multiple of g = 
gcd(a, 6). Let a' = a/g and b' = bjg. Then gcd{a',b') = 1. The largest integer 
that cannot be represented as a'x + b'y, with x,y > d, is a'b' — (a' -I- h') (see, 
e.g., HSI). By just multiplying by g, we get that the largest multiple of g that 
cannot be written as ax + by, with x,y > 0, is lcm(a, b) — {a + b). □ 

Given an alphabet S, E* denotes the set of strings on S. Given a language 
L C E* , its complement, i.e., the set E* — L, is denoted as A language L is 
said to be unary (or tally) whenever it can be built over a single letter alphabet. 
In this case, we let L C 1*. 

The computational models we will consider in this paper are one-way deter- 
ministic finite automata (dfa) defined over a one letter input alphabet E = {!}. 
A dfa will be denoted as a 4-tuple A = {Q, 5, qo, F), with the usual meaning (see, 

e.g., 0). 

It is not difficult to observe that the transition graph of a unary dfa consists 
of a path, which starts from the initial state, followed by a cycle of one or more 
states. All automata we will consider are complete and connected. As in PI, the 
size of a dfa A is the pair (A, /r), where A > 1 and p. > 0 denote the number of 
states in the cycle and in the path, respectively. Throughout the paper, we will 
use the following convention to denote any unary automaton A = {Q, S, go, F) of 
size (A, p): the set of states is denoted as Q = {qo, di, ■ ■ ■ , 

where qo,qi, . . . , q^-i are the states on the path, and Po,Pi, ■ ■ ■ ,p\-i are the 
states on the cycle (with qo = po when p = 0); then S{qi, 1) = qi+i, for i = 
0,... ,p-2, (5(g^_i,l) = po, and S{p^,l) =P(i+i)MODA, for i = 0, . . . , A - 1. A 
unary dfa of size (A, p) is represented in Figure [D 

Observing the form of unary dfa’s, it is not difficult to conclude that unary 
regular languages correspond to ultimately periodic sets of integers. More pre- 
cisely: 

Theorem 1. Given a unary language L and two integers X > 1, p > 0, the 
following statements are equivalent: 
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Fig. 1. A unary dfa of size (A, /r) 



(i) L is accepted by a dfa of size (A,/r); 

(ii) for any n > n, 1'^ G L if and only if 1"+'^ g L. 

A unary dfa is said to be cyclic if and only if its graph is a cycle. Languages 
accepted by cyclic automata are said to be cyclic languages. In other words, a 
unary language is cyclic if and only if it can be accepted by a dfa of size (A, 0), 
for some A > 1. To emphasize the periodicity of L, we say that L is X-cyclic. 
The following property can be easily proved by using Lemma E 

Theorem 2. If a unary language L is both -cyclic and -cyclic, for some 
A', A" > 1, then L is gcd{X' ,\”) -cyclic, too. 

In order to show the optimality of the constructions we will give, we now 
present a condition which characterizes minimal unary dfa’s (see also |H1 Lemma 
1 ]): 

Theorem 3. A unary dfa A = {Q,6,qo,F) of size {X, pi) is minimum if and 
only if both the following conditions are satisfied: 

(i) for any maximal proper divisor d of X (i.e., X = a ■ d, for some prime 

number a > 1) there exists an integer h, with 0 < h < X, such that 
Ph G F if and only if P{h+d) mod x i F, i.e., G L if and only if 

^n+h+d ^ . 

(ii) q^-i G F if and only i/pA-i ^ F, i.e., G L if and only ^ L. 



Example 1. In order to prove the optimality of several results presented in the 
following sections, we will make use of the language L = where 

pi > 0 and A > 1. Using Theorem 0 it is easy to prove that the size of the 
minimum dfa A accepting L is (A, pi). In particular, the only final state of A is 
Px-i- 
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The following result, which gives a tight evaluation of the state complexity 
of the union and intersection of unary regular languages, was recently proved by 

Shallit |TT]: 

Theorem 4. Let L' and L" be two languages aeeepted by unary automata A' 
and A" of size (A',/x') and (A",^"), respeetively. The interseetion (the union, 
respeetively) of L' and L" is aeeepted by a dfa of size (lcm(A', A"), max(/r', p,")). 
Furthermore, for any A', A" > 1, pd , pT > 0, there exists a pair of languages 
L',L" witnessing the optimality of this bound. 



3 Concatenation 

Given two dfa’s with m and n states, accepting two languages L' and L", respec- 
tively, the concatenation L = L'L” is accepted by a dfa with m2" — 2"“^ states 
m- This result cannot be improved if the input alphabet contains at least three 
symbols. However, in the unary case, the number of states which are sufficient 
to recognize L' L" reduces to mn. This number is also necessary, in the worst 
case, when m and n are relatively prime. As for the intersection, asymptotically, 
the worst case state complexity remains the same, even when m and n are not 
required to be relatively prime |E|. 

In this section, we further analyze the state complexity of the concatenation 
in the unary case, by evaluating the optimal size of an automaton accepting the 
concatenation of the languages accepted by two given unary dfa’s. Moreover, we 
are able to show that, for some subclasses of unary regular languages, the size 
of the resulting automaton can be further reduced. 

We start by considering two particular cases: 

Theorem 5. Given A', A" > 1, p!,yL" > 0, let V and L” be unary languages 
accepted by two dfa’s A' and A" of size (A',/r') and (A",/i"), respectively. 

(i) If L" is finite then L'L" is accepted by a dfa of size (A', p,' + p!' — 1). 

(a) If both languages L' and L" are cyclic then L'L" is accepted by a dfa of 
size (gcd(A',A"),lcm(A',A") - 1). 

Proof. If L” is finite, then any string in L” has length less than p,” . Thus, given 
an integer n > p' + p" — 1 such that 1" G L'L" , there are two integers x and y 
such that n = x+y, 1’" G L' , G L", y < p", and x > p' . Since L' is accepted by 
a dfa of size {X',p'), this implies that even !“+-’' g L' . Hence, 1"+'’' g L'L" . By 
similar arguments, we can also prove that, for any n > p' + p" — 1, 1"+'’' g L'L" 
implies that 1" G L'L". Thus, in the light of Theorem P we conclude that L'L" 
is accepted by an automaton of size {X' , p' + p" — 1). 

If both L' and L" are cyclic, then there exist two sets Z' C {0, . . . , A' — 1}, 
Z" C {0, . . . , A" — 1}, such that 

L' = {r'+Fk \ z' G Z' and fc > 0} and L” = {v"+^''o \ z" G Z" and j > 0}. 
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In the light of Theorem ^ in order to prove that L'L” is accepted by a dfa 
of size (gcd(A', A"),lcm(A', A") — 1), it is enough to show that, for any integer 
z > lcm(A', A") - 1, G L'L” if and only if l^+gcd(A'.A") ^ 

To this aim, consider an integer w > lcm(A', A") — 1 with I’" G L'L" , and 
integers z' G Z' , z" G Z", i,j > 0, such that w = z' + z” + X'i + X”j . X'i + X''j is a 
multiple of gcd(A', A"), and it is greater than lcm(A', A") — (A'+A")+l. By Lemma 
^ it turns out that even X'i + X" j + gcd(A', A") can be represented as X' x + X''y, 
for some x,y >0. Thus, it is easy to conclude that l^+s^dlA ,a ) ^ L'L". 

Using similar arguments, it is possible to show that i“'+gcd(A ,a ) g L'L" 
implies 1’" G L'L". □ 

Now, we prove that the results stated in Theorem^are optimal. For the case 
of L" finite, the following result can be easily proved: 

Theorem 6. Given some integers y! > 0, y'',X',X” > 1, the infinite language 
L' = I'' “^(1^ )* and the finite language L" = 1" are accepted by two dfa’s 

of size (X',y') and (A",/i"), respectively. Moreover, the size of the minimum dfa 
accepting the concatenation of L' and L" is (A',/r' + y” — 1). 

The optimality of statement (ii) of Theorem El is a consequence (in the case 
y' = y" = 0) of the following result: 

Theorem 7. Let y',y” > 0, A', A" > 1 &e integer numbers. The languages 
L' = lM'+A'-l(j^Ay ^// ^ ^ 

are accepted by two dfa’s of size {X',y') and (A",/r"), respectively. Moreover, the 
size of the minimum dfa accepting L'L” is (gcd(A', A"), /r' + /r" + lcm(A', A") — 1). 

Proof. To see that languages L' and L" are accepted by two dfa’s of size (A', y') 
and {X” ,y”), consider Example ^ 

For the sake of simplicity, we will compute the size of the minimum dfa 
accepting L'L" in the case y' = y” = 0. The extension to the general case is 
trivial. 

By Theorem EKii), there is an automaton A of size (gcd(A', A"), 1cm (A', A") — 
1), accepting L = L'L" = {1"^ \ x,y > 1}. Using Theorem 0 we now 

show that A is minimum. 

Since all numbers of the form X'x + X”y, with x,y > 0, are multiple of 
gcd(A', A"), the difference between the lengths of two strings in L is a multiple 
of gcd(A',A"). Thus, the cycle of A, whose length is just gcd(A',A"), cannot 
contain more than one final state. This implies that condition (i) of Theorem 0 
holds. 

To show that even condition (ii) is satisfied, we now prove that A )~2 ^ 

L, while A )- 2 +gcd(A ,A ) g L. To this aim, we consider numbers m of the 

form m = lcm(A', A") — 2 + fcgcd(A', A"), with fc > 0. The string 1™ belongs to 
L if and only if there are two integers x,y >1 such that m = X'x + X”y — 2, i.e., 
if and only if there are x,y >0 such that X'x + X”y = lcm(A', A") — (A' + A") + 
fcgcd(A',A"). By LemmaEl this condition is satisfied if and only if A: > 1. This 
implies that even condition (ii) of Theorem 0 holds. □ 
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Now, we consider the general case: 

Theorem 8. Given > 0, X',X" > 1, let L' and L" he unary languages 

accepted by two automata A' and A!' of size (A',/i') and respectively. 

Then, the concatenation of L' and L" is accepted by a dfa of size (A, /r), where 
X = lcm(A', A") and fj. = fj,' + fi" + lcm(A', A") — 1. 

Proof, (outline) Let X' , X" be the languages accepted by the initial paths of 
A!, A”, i.e., X' = L' n {P I 0 < a; < /r'}, X” = L” {V^ \ Q < x < p!'}, and Y' , 
Y" the languages accepted by restricting automata A! , Ad' to their cyclic parts. 
Hence, L' = X' G WY' and L" = X" U The product L of L' and L" can 

be expressed as: 

L = x'x" u u u (i) 

Since the languages X' , X" , Y' , Y" , , and can be accepted by dfa’s of 

size (1,/r"), (A',0), (A",0), (1,^' + 1) and (l,/r" + l), respectively, using 

Theorem El and Theorem El it is not difficult to conclude that L is accepted by 
a dfa of size (A, p), where A = lcm(A', A") and p = p' + p" + lcm(A', A") — 1. □ 

We now study the optimality of the result stated in Theorem 0, First, we 
prove that this result is optimal when gcd(A',A") > 1. Subsequently, we will 
consider relatively prime X' and X"\ we will show that in this case the number 
of states in the cyclic part can be further reduced. 

Let us start by proving the following result: 

Theorem 9. For any p' , p" > 2, A', A" > 2, such that gcd(A',A") > 1, there 
exists two unary languages L' and L" which are accepted by two automata A' 
and A" of size {X' , p') and {X",p"), respectively, such that the concatenation 
of L' and L" is accepted by a dfa of size (X,p), where X = lcm(A',A") and 
p = p' + p" + lcm(A', A") — 1. 

Proof. If A" divides A' (A' divides A", respectively), then the result is a conse- 
quence of Theorem El Now, suppose that A' does not divide A" and A" does not 
divide A', and consider the languages: 

L' = iF+Y-lf^lYy y lF-2 ^// ^ 

It is not difficult to describe two automata Ad and Ad' of size (A', ^') and (A", /r") 
accepting L' and L" , respectively. From these automata, according to Theorem 
El an automaton A of size (A, p) accepting L can be obtained. We observe that 
a state px on the cycle of A, with 0 < a: < A, is final if and only if there is an 
integer fc > 1 such that either x = — 1, or a; = kX — 2, or x = kX' — 2, where 

g = gcd(A', A"). 

In order to show that A is minimum, we prove that both conditions (i) and 
(a) of Theorem El are satisfied. 

Consider a maximal proper divisor d of A. Then, either X' divides d, or X" 
divides d. Suppose that A' divides d, i.e., d = f3X, for some /3 > 1, and consider 
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h = X" — 2. Then h + d = A" — 2 + j3X' . So, P{h+d)Mou\ & F if and only if 
there exists an integer A: > 1 such that either (a) A" — 2 + /3X' = kg — 1, or (b) 
A" - 2 + /3A' = k\' - 2, or (c) A" - 2 + /3A' = fcA" - 2. 

Since g is greater than 1 and divides both A' and A", the equality (a), which 
reduces to A" + /3A' = kg+1 cannot be verified, for any integer k. Also equality 
(b) cannot be verified since it implies that A' divides A". Finally, equality (c) 
reduces to /3A' = {k — 1)A"; thus, it implies that f3X', namely d, is a multiple 
of both A' and A", i.e., a multiple of A. This is a contradiction. Thus, we are 
able to conclude that P(h+d) mod x ^ F, while ph £ F. The case of A" which 
divides X' can be managed in a similar way. Hence, condition (i) of Theorem 0 
is satisfied. Using Lemma0 it is possible to verify that +icm(A ,a )-2 ^ 

while +icm(A ,A )+A -2 g Hence, even condition (ii) of Theorem 0 holds. 

This implies that A is minimum. □ 

Theorem 0 shows the optimality of the result stated in Theorem 0 for all 
p' , fi" , X' , X" such that gcd(A',A") > 1, with few exceptions for small p',p". 

We now consider the case of relatively prime A' and A". The number of states 
in the cyclic part of the minimum dfa accepting the product of L' and L" is less 
than lcm(A',A"). In particular, if both languages are infinite, then this number 
reduces to 1, while if L” is finite it reduces to A': 

Theorem 10. Let L' and L" be unary languages accepted by two automata A! 
and A" of size (A',/r'), {X",p"), respectively, with p' , p” > 0, A', A" > 1, such 
that gcd(A', A") = 1. 

If both L' and L” are infinite, then their concatenation is accepted by an 
automaton A of size {1, p' + p" + X' X" — 1) ; if L" is finite, then the concatenation 
of L' and L” is accepted by an automaton of size {X' ,p' + p" — 1). 

These results are optimal, with the only exception of the triuial case L" = 0. 

Proof. Suppose that both L' and L” are infinite. The concatenation L of L' and 
L” can be expressed as in equality (01 (Proof of TheoremlEI) . Since gcd(A',A") = 
1, any string with x > p' + p" + lcm(A', A") — 1 belongs to iF+F'y'Y'' and 
then to L. Hence, it is possible to conclude that a cycle of length 1 is sufficient. 
The optimality is a consequence of Theorem 0 

When L" is finite, the result is an immediate consequence of Theorem 0 and 
of Theorem 0 □ 

Some Particular Cases 

In the proof of Theorem 0 we have outlined the construction of an automaton 
A accepting the concatenation L of two languages L' and L” accepted by two 
given unary dfa’s A' and A”. To get the automaton A, in equality (0 we have 
expressed the language L as the union of some languages which are obtained by 
combining the cyclic and the noncyclic parts of L' and L”. When one or both 
noncyclic parts are empty, some of the languages on the right side of (0) are 
empty. Thus, evaluating in these cases the size of the resulting automata, one 
can easily get the following result: 
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Theorem 11. Let L' and L" be unary languages aecepted by two automata A' 
and A" of size (A',/i') and (A",/x"), respeetively. The eoneatenation of L' and 
L" is aecepted by a dfa of size (A, p), where pL = p! + p!' + lcm(A', A") — 1 and: 

(i) if the initial path of A' does not contain any final state, then A can be 
taken equal to A'; 

(ii) if the initial path of A" does not contain any final state, then A can be 
taken equal to A"; 

(Hi) if both the initial paths do not contain any final state, then A can be taken 
equal to gcd(A', A"). 

We point out that even the results stated in Theorem HD are optimal when 
gcd(A',A") > 1. For statement (iii), this is a consequence of Theorem [3 This 
also implies the optimality of (ii) when A" = gcd(A', A") = 2. On the other hand, 
for A" > 2, the optimality of (ii) is given by the following result, whose proof is 
similar to that of Theorem |3 

Theorem 12. Given p! , X' > 2, A" > 2, p!' > 0 such that gcd(A', A") > 1, 
consider the languages 

L' = u iJ' — +\" -1 Y 

The languages L' and L" can be accepted by two automata A' and A" of size 
(A',/i') and (A",/r"), respectively. Furthermore, the minimum dfa accepting the 
concatenation L = L'L" has size (A", p,' + p” + lcm(A', A") — 1). 

We point out that the optimality of Theorem for gcd(A', A") > 1, can 

be shown in a similar way. 

We conclude this section by summarizing, in Table ^ and in Table 0 the 
results we have proved concerning the state complexity of the concatenation of 
two languages L' and L" accepted by two automata A' and A" of size {X',p') 
and {X",p"), respectively. 



Table 1. State complexity of the concatenation, when gcd(A',A”) > 1. X' and X” 
denote the languages accepted by the initial paths of A! and A” , respectively, i.e, 
X' = {a^ £L' \ x< p'} and X” = {a" £ L” \ x < p”}. 





X" 0 


X" = 0 


X' 


(lcm(A' , , ft' ft" lcm(A^ ? ~ 1) 

upper bound: Th. 0 
lower bound: Th. 0 


(A'^, fL' + fL" + lcm(A^, A") — 1) 
upper bound: Th. IXH 
lower bound: Th. Oand Th. TT7\ 


X' = % 


(A',/i' +/i" +lcm(A', A") - 1) 
upper bound: Th. 
lower bound: Th. 171 and Th.l^ 


(gcd(A' , A" ) , /i' + /i" + lcm(A' , A") — 1) 
upper bound: Th. I^| 
lower bound: Th.ri 
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Table 2 . State complexity of the concatenation, when gcd(A', A”) = 1. 





#L' =oo 


< oo 


#L" = oo 


(1, F + u" + - 1) 

Th.nn 


upper bound: Th. 
lower bound: Th. 


1 


#L" < oo 


(\',F +v" -1) 

upper bound: Th.EI 
lower bound: Th.O 


(l.M' +m" - 1) 

upper bound: Th.EI 
lower bound: trivial 



4 Star 

We conclude the paper by presenting, in this section, some considerations con- 
cerning the state complexity of the star operation in the unary case. 

First of all, we recall the following result [HI Th. 5.3]: 

Theorem 13. If L is a unary regular language accepted by an n-state dfa, then 
L* is accepted by a dfa with (n — 1)^ -|- 1 states. Furthermore, for any n > 1 this 
result cannot be improved. 

We observe that if L is accepted by an automaton of size (A,/i), then the 
cycle in the minimum dfa accepting L* cannot have more than A states, i.e., the 
size (A*,/r*) of the minimum automaton accepting L* verifies A* < A. 

We now analyze some limit situations. 

First, we suppose that p. = 0, i.e., L is A-cyclic. If L = (!''*)*, then L = L* 
and = (A, 0). Otherwise, let k be an integer such that G L and 

0 < k < X. Any string having length of the form kx + Xy, with x>l and y > 0, 
or a; = y = 0, belongs to L* . Hence, the length of the loop is A* < gcd(A, k) < k. 
In particular, when L = 1^(1''')*, by using LemmaQl it is not difficult to conclude 
that A* = gcd(A, k) and p* — lcm(fc. A) — A -I- 1. For A: = A — 1 this reduces to 
A* = 1 and p* = {X — 1)^, which exactly matches the number of states given in 
Theorem El 

Now, we suppose that p > 0 and A = 1. If po G then all strings of length 
greater than p — 1 belong to L. Thus L* is accepted by an automaton of size 
(1,/r*), with p* < p. On the other hand, if po ^ then L is finite. This case 
was analyzed in proving that L* is accepted by an automaton with at most 

— 7n+ 13 states, where n = p+ X = p+1 (this result, which is optimal, holds 
for n > 4 and for n = 3). We can suppose that 1^^“^ g L, otherwise the size of 
the dfa accepting L can be reduced. If L = then L* is accepted by an 

automaton of size {p — 1, 0). If L = {1®, 1^“^}, with 0 < s < p — 1, then, using 
Lemma n it can be shown that L* is accepted by an automaton of size (A*, p*), 
with A* = gcd(p — 1, s) and p* = lcm(/i — 1, s) — p — s -I- 2. In particular, when 
s = p — 2, we get A* = 1 and p* = p'^ — 5p + 6 (note that for n = X + p, X* + p* 
is exactly — 7n-|- 13). As pointed out in Q, the reader can verify that the size 
obtained in the last case is an upper limit for the case of L containing three or 
more words. 
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Abstract. An implementation of a recent algorithm (Voge, Jnrdzinski, 
to appear in CAV 2000) for the construction of winning strategies in infi- 
nite games is presented. The games under consideration are “finite-state 
parity games”, i.e. games over finite graphs where the winning condition 
is inherited from cj-automata with parity acceptance. The emphasis of 
the paper is the development of a user interface which supports the re- 
searcher in case studies for algorithms of a;-automata theory. Examples 
of such case studies are provided which might help in evaluating the (so 
far open) asymptotic runtime of the presented algorithm. 



1 Introduction 

The algorithmic approach to automata theory, which has been so successful in 
the theory of automata over finite words, has not yet produced results of similar 
strength in the theory of w-automata, i.e., automata over infinite sequences. 
Although many intriguing and mathematically powerful constructions exist in 
w-automata theory, the high complexity of most algorithms has so far prevented 
practical applications. Examples of such constructions are the Safra construction 
for determinizing w-automata, the complementation of Biichi automata, and the 
construction of winning strategies in infinite games. 

The package OMEGA of algorithms in w-automata theory, presently under 
development at RWTH Aachen, should support the researcher in getting experi- 
ence with the performance of these algorithms and in carrying out case studies. 
In the present paper we address a problem from w-automata theory which be- 
longs to the theory of infinite games: the construction of winning strategies in 
parity games. These games have a direct connection to the model-checking prob- 
lem for the modal /r-calculus (cf. |3|). One of the central open questions in the 
field asks whether this model-checking problem or, equivalently, the construction 
of winning strategies in parity games is possible in polynomial time. 

In jS] a new algorithm was developed, which has the potential of being more 
efficient than previously known procedures (e.g., PEE!). The new algorithm 
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S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 263-^] 2001. 
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follows the idea of strategy improvement as devised by Puri 0, but now in a 
completely discrete setting rather than over a dense value domain. The algorithm 
is hard to analyze; indeed, it is open whether the running time of this concrete 
algorithm can be bounded by a polynomial. Furthermore, the execution of the 
algorithm is hard to follow since it induces in each step a global change over the 
given game graph. In this situation, it is helpful to study the algorithm with an 
implementation and some auxiliary functions which help in assessing the main 
characteristics of its behaviour. 

In the present paper we report on this implementation work, and on the user 
interface, and we discuss the output of the program in some simple example 
cases. In this way, we prepare a platform also for model-checking in the full 
/x-calculus (this issue is, however, not treated in the present work). 

The paper is structured as follows: We present, in section 2, a short out- 
line of the algorithm (for more detailed pseudocode see jS]) and describe some 
improvements which make an implementation more efficient. Based on these 
considerations, we have implemented the algorithm in C. 

The focus of this paper is, however, the user’s view of the program and the 
discussion of some case studies. These issues are treated in sections 3 and 4. 

In section 3, we first present a Lisp-based input language for describing large 
game graphs. Secondly, the output format is discussed, including some features 
which support the user in analyzing the performance of the algorithm. Three 
essential parameters are isolated: The number of iterations of the main loop of 
the algorithm, the number of changes of strategy for each vertex, and (regarding 
the global running time) the number of edge traversals in the search procedures 
of the algorithm. For the second parameter a two-dimensional output (in matrix 
format) is developed. 

Section 4 illustrates the use of the program. From the case studies consid- 
ered so far we only discuss here an example taken from [Q, where a classical 
(exponential-time) solution of parity games was given. 

2 Parity Games and the Construction of Winning 
Strategies 

A parity game is an infinite game played on a finite colored graph (the “arena” 
of the game). The game graph is of the form G = (Vq, Vi, i?, c). The set V of 
vertices is the union of the two disjoint sets vertex Vq and Vi, and we have 
E C (Vo X Vi) U {V\ X Vo), i.e., an edge leads from a Vb-vertex to a Vi-vertex or 
vice versa. We assume vE yf 0 for all vertices v. 

A play starting from vertex v is an infinite sequence vqViV2 ... of vertices 
with {vi,Vi +i) e E for all i > 0; one imagines players 0 and 1 moving a token 
along edges through the graph in a nonterminating sequence of steps. If Vi G Vq 
it is the turn of player 0 to decide for an edge leaving Vi, otherwise player 1 has 
to choose an edge leaving Vi. 

The winner of a play vqViV2 ... is defined with a reference to the component 
c of the game graph, which is a function (called coloring) c : V —>■ {0, . . . ,kj 
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for some k. The play vqViV 2 ... is won by player 0 if in the associated sequence 
c{vo)c(yi)c{v 2 ) ... of colors the maximal color occurring infinitely often is even; 
otherwise player 1 wins the play. 

It is a well-known theorem of the theory of infinite games (see e.g. that 

parity games (even over infinite game graphs) are “determined” in the following 
strong sense: From each vertex either player 0 or player 1 can force a win by a 
positional winning strategy. More precisely, the vertex set is partitioned in two 
winning regions Wq, Wi such that player 0 wins starting from any vertex in Wq 
by a fixed choice of outgoing edges for the Vq- vertices in Wq, and analogously 
player 1 wins starting from any vertex in Wi by a fixed choice of outgoing edges 
for the Vi -vertices in Wi. The “solution of the game” is given by the sets Wo, Wi 
and corresponding edge sets Eq C (Vq fl Wq) x V\ and Ei C {V\ fl Wi) x Vq. 
As shown by |3|, the problem of solving parity games over finite graphs is in 
NP n co-NP and polynomially equivalent to the model-checking problem for the 
full modal /i-calculus. All known algorithms j'2lt)l4| have an exponential worst 
case running time. A key problem of the theory of program verification is to 
settle the question whether parity games can be solved in polynomial time. 

In [B|, a new algorithm was developed, based on the idea of strategy improve- 
ment (which originates in work on stochastic games and mean pay-off games). 
The running time of this algorithm can be trivially bounded by the number of 
possible strategies (say of player 0), which gives an exponential upper bound, 
but it is an open question whether this can be sharpened to a polynomial bound. 
This motivates an experimental study of the algorithm as initiated in this paper. 

Let us sketch the algorithm of 0 . The key idea is a discrete valuation of plays 
and a process which successively modifies strategies of player 0 to guarantee 
higher and higher values of associated plays. To define this valuation of plays, 
we use a “relevance order” < of the vertex set, which is a total order refining the 
order according to the coloring (so higher colors signal higher relevance). The 
vertices of the highest even color are the most valuable for player 0, while the 
vertices of the highest odd color are most valuable for the opponent, player 1. 
This defines the “reward ordering” ^ (for player 0), obtained by concatenating 
the reversed relevance ordering of the odd-colored vertices and the relevance 
ordering of the even-colored vertices. (So a higher position of a vertex in the 
reward ordering signals an advantage for player 0.) 

Assume a strategy pair (cr, r) of the two players is given, where a is defined 
by the choice of one out-edge for each vertex in Vq, similarly r by the choice 
of one out-edge for each vertex in Vi. The strategy pair induces a well-defined 
play once an initial vertex is given; moreover, this play will end in a loop (which 
is completed when the first repetition of a vertex occurs). Given (cr, r), each 
vertex v will get a value (also called play profile) which is extracted from the 
associated play starting in v. Given this valuation on the whole game graph (and 
an ordering of the value set), a strategy improvement step of player 0 will then 
consist in locally redefining the choice of his out-edges: he will pick, for each 
vertex u, an edge to a neighbour vertex v of highest value. 
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The play profiles are composed of three components, referring to the relevance 
order introduced above. Recall that for a given strategy pair (cr, r) and vertex 
V the resulting play will end in a loop L. The three components r,P,m of the 
play profile of v are, in their order of significance, defined as follows: The first 
component is the most relevant vertex r of L, and called positive if its color is 
even, otherwise negative. Similarly, a play profile (r, P, m) is called positive (resp. 
negative) iff r is. We call r shortly “the most relevant loop vertex associated to 
v” . The second component P of the play profile is the set of vertices more relevant 
than r and visited before entering the loop L (from v), and the third component 
is the number m of vertices visited before reaching r. 

The play profiles are linearly ordered by an ordering ^ as follows: 



(r, P, m) -< (/, P' , m') 



{ r < r' 

V (r = r' A P ^ P') 

\/ {r = r' A P = P' A r' GV- Am < m') 
W (r = r' A P = P' Ar' G Am > m') 



where P < P' holds if there is a most relevant vertex in the symmetric difference 
of P and P' that is either positive and in P' or negative and in P. 

The algorithm is outlined in Figure ^ 



Strategy Improvement Algorithm 

1. Choose an arbitrary strategy a for player 0 (by picking an ont-edge 
for each vertex in Vo). 

2. Evalnate this strategy for player 0 (by computing an “optimal re- 
sponse” strategy r of player 1, which indnces together with a a 
valuation of the game graph with play profiles). 

3. By picking a neighbour vertex of highest value (play profile) for each 
vertex in Vo, construct a new strategy a for player 0 and continue 
with step 13 if the new strategy a was assumed already before, pro- 
ceed with step 4. 

4. The winning region Wo (resp. Wi) consists of the vertices with pos- 
itive (resp. negative) play profile, and the desired winning strategies 
of player 0 and 1 are extracted from the computed strategies cr and 
r on the sets Wo and Wi. 



Fig. 1. Strategy Improvement Algorithm 



The crucial step in this algorithm is the computation of the “optimal response 
strategy” of player 1 in step 2. For these details we have to refer the reader to the 
paper [3|. In our implementation, written in C, some algorithmic improvements 
over 13 are realized, which significantly affect the running time. For the sake of 
completeness we list them here (but skip the algorithmic justification) : 
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1. Only play profiles of vertices v are computed where the associated most 
relevant loop vertex r is negative. 

2. A strategy a is not changed at vertex v when the associated most relevant 
loop vertex is positive. 

3. The second component of the play profile is represented in tree form, which 
gives a linear memory bound for these components over all vertices. 

In the sequel, we concentrate on the second component of the implemen- 
tation, a user interface written in Common Lisp. This user interface allows to 
name vertices by arbitrary symbols or tuples of symbols. Using the interactive 
environment of Lisp, one can easily generate scalable game graphs, as needed 
for parametrized case studies. 



3 User Interface 

3.1 Input Language 

The input language is embedded in Common Lisp, extended by some functions 
and macros that allow a convenient description of parity game graphs. 

The synthesis algorithm can be called from within a Common Lisp environ- 
ment by 

(strategy vertices — of — playerO vertices — of — playerl edge — list colors order) 

This function call starts with its name strategy. The parameters vertices — of— 
playerO and vertices — of— playerl are the lists of vertices associated with player 0 
resp. 1. The parameter edge— list is a list of edges; each edge is itself a list of the 
form (source target) where source and target are vertices. The parameter colors 
is a list (colorO color 1 ... colorn) where colori is the list of vertices of color i, 
usually in the order of increasing relevance. 

The parameter order can be omitted; in this case (which corresponds to the 
option; last — best) the colors are considered to be ordered by increasing relevance. 
If the parameter order is set to be ; first — best, then the colors are considered to 
be listed in decreasing relevance, as well as the vertices in each list colori. 

A vertex can be any Lisp object. The equality of two vertices is determined 
by the function equal. 



Example Consider the parity game graph G 2 = (Vq, V\,E, c) with Vq = {2, 4}, 
Ui = {1,3}, E = 1(1,2), (2,1), (2,3), (3,4), (4,3)1 and c(l) = 1, c(2) = 1, 
c(3) = 1, c(4) = 0 as presented in Figure 1. We indicate vertices in Vq by circles 
and vertices in V\ by boxes. 




Fig. 2. Example graph G 2 . 
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The procedure for computing the winning regions and corresponding winning 
strategies is called as follows Q 

( strategy ’ ( 2 4 ) 

’(1 3) 

’((1 2) (2 1) (2 3) (3 4) (4 3)) 

’((4) (3 2 1)) 

: first —best ) 

Instead of listing the elements for the parameters of such a function call ex- 
plicitly, we shall use functions that generate the parameters for a call of strategy. 



Macrolanguage For larger examples, an algorithmic definition of game graphs 
is appropriate. We develop a notation for such definitions, again in the form of 
a Lisp macro, now with the name parity— game. It provides an environment in 
which the three macros vertex, edge and dedge are available. These introduce the 
vertices, the edges, and for convenience also double edges (“dedges”) which are 
considered present in both directions. 



Example We shall explain its use for the example graphs G„ given in Figure 0 




Fig. 3. Example graph Gn- 



A specification in Pascal-like pseudo code is given below as well as the corre- 
sponding Lisp-Code. 



for v:= 1 to 11 + 2 do 
V is vertex with: 

if even then v in Vq 
else v in Vi 

if v = 2 * n then color (v) = 0 
else color ( v) = 1 

for i := 1 to ii do 

(2* i — 1,2* i) is d— edge 
for i := 1 to II — 1 do 

(2 * i 5 2 * i +1) is edge 



(parity —game 
( for ( v 1 ( * n 2 ) ) 

(vertex (v) 

(if ( evenp v) 0 1) 

(if (= V (* 2 n)) 0 1))) 

( for ( i In) 

(dcdgc ((- (* 2 i) 1)) ((* 2 i)) )) 
( for ( i 1 ( — n 1 ) ) 

(odge ({* 2 i ))((+(* 2 i) 1)) ))) 



The introduction of the edges is self-explanatory. For the vertices, one has to 
declare, besides the name, the player to which it belongs, and its color (in the 
example the even vertices belong to player 0, the other to player 1, and the color 
is set 0 for vertex 2n, otherwise 1). In this environment, a vertex is always a list. 
The conditionals are written as usual, with the if-clause and the two values for 
the yes- resp. no-case. We introduced also a for-loop with the obvious meaning. 



^ The quote symbol in front of a parameter tells the interpreter that it is constant, 
otherwise the following list would be also interpreted as a function call. 



Implementation of a Strategy Improvement Algorithm 269 



For the examples discussed below we also use definitions of the same format 
(not given in this extended abstract). 

3.2 Output 

The output contains the following data: the computed winning regions of players 
0 and 1, optimal strategies for the two players, and the number of iterations 
used (improvement steps). The latter is the critical parameter for separating 
polynomial from exponential behaviour of the algorithm. 

The listed strategies are winning on the respective winning region. On the 
complement the computed strategy is of course not winning, but just as good as 
possible with respect to the reward order (in the sense of section 2). 



Example In the example considered above, one sees the following output (which 
says that player 0 wins from each vertex by choosing the listed edges from vertices 
in Vb): 

(4321) ; winning region for 0 

NIL ; winning region for 1 

((2 3) (4 3)) ; winning, resp . optimal strategy for 0 

((1 2) (3 4)) >■ winning, resp. optimal strategy for 1 

2 ; number of iterations 

For a more detailed analysis of the algorithm we found the following parameters 
to be significant (and accessible by separate calls of corresponding functions): 
The overall running time is measured in the number of edge traversals. By 
this we mean the number of all edges traversed for computing the valuations. The 
number of executions of each other basic operation is bounded by the number 
of edge traversals. That means runtime is approximately linear in the number 
of edge traversals; and the number of edge traversals is a system independent 
performance measure. 

The complicated feature of the algorithm is the possibility of changing the 
strategy for each vertex of player 0 in each step. An interesting overview of 
the execution is thus obtained by a matrix in which each row describes the 
result of a given iteration for the vertices of player 0. We call this matrix the 
improvement trace. Each vertex is associated to one column, in the order of 
increasing relevance. The t-th row records the result for the i-th iteration. A 
mark at position (z, j) of the matrix indicates that the strategy for vertex j (in 
the given order) is changed in iteration i. We distinguish two kinds of marks, 
star and dot. By this we record whether in the corresponding improvement step 
the most relevant loop vertex (see section 2 above) of vertex j is changed during 
iteration i or not. If this happens then the mark is taken to be a star, otherwise 
a dot. 

An important observation in this connection is the following: If n is the 
number of vertices, there may by only number of iterations containing a star. 
Hence the crucial point in deciding whether the algorithm is polynomial time is 
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to bound the number of iterations without star entries. (In the present extended 
abstract we have to skip the examples which illustrate this.) 

4 Remarks on Case Studies 

The first case studies carried out with the system described above were concerned 
with example games discussed in previous literature on the subject, where dif- 
ferent approaches to the problem of solving parity games were tried. Two such 
papers are based on a recursive algorithm of McNaughton |3, and where 
progress measures were applied. In both cases, suitable families of game graphs 
can be defined which prove exponential worst case behaviour for each algorithm. 
The analysis of the present algorithm with our implementation and user inter- 
face shows that here a linear number of iterations suffices to reach termination 
(which shows its polynomial time behaviour in these special cases). 

In this extended abstract, we deal only with the case study of Q. 

Let us describe the additional output function of the implementation for the 
version of the graphs of with 136 vertices, where the vertex of player 0 are 
numbered 0, 2, ... , 134. The numbers of vertices of player 0 are noted as names 
of vertices in the computed improvement trace (in decimal notation top-down) . 



111111111111111111 

111112222233333444445555566666777778888899999000001111122222333 

02468024680246802468024680246802468024680246802468024680246802468024 



* * 

♦ * * * 
♦ * 



7| * 

8| * 

9| * 

101 * 

111 * 



Fig. 4. Example of Improvement Trace 



The output (see Figure^ gives information about the iterations I to II. 
From the improvement trace one infers that the algorithm adjusts several choices 
simultaneously until the 6th iteration and then proceeds by adjusting the strat- 
egy at the single vertices 16,12,8,4,0 in the remaining iterations. In this case, 
there are 12 iterations, the last one offering no change in strategy (which causes 
termination) and hence suppressed in the trace. 
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5 Conclusion 

We have provided an implementation of the algorithm of 0, extended it by a user 
interface, and applied it in several case studies. In ongoing work, more and larger 
examples are analyzed, and the program is connected to other procedures of the 
system OMEGA. Moreover, the algorithm is being applied to the model-checking 
problem of the modal /x-calculus, with a suitably extended user interface. 
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Abstract. We find bounds for the state complexity of the intersection of 
regular languages over an alphabet of one letter. There is an interesting 
connection to Jacobsthal’s function from number theory. 



1 Introduction 

The state complexity of a regular language L, written sc(L), is the number of 
states in the smallest deterministic finite automaton (DFA) accepting L. Several 
papers, such as P2|. address the question of obtaining good upper bounds on 
the state complexity of basic operations such as LL', L* , etc., in terms of the 
state complexity of L and L' . Additional questions of interest arise when one 
restricts L, L' to be, for example, finite languages. 

The standard product construction for automata (e.g., 0 pp. 59-60]) easily 
shows that if sc(L) = n, sc(L') = n', then sc{L fl L') < nn' . This upper bound 
of nn' can actually be attained for all n, n' > 1 provided the underlying alphabet 
has at least two letters. Indeed, as Yu and Zhuang observe HH, we can let 

L = {x^{a + b)* : |a;|a = n} 

and 

L' = {x&{a + h)* : \x\b = n'}, 

where |a;|c denotes the number of occurrences of the symbol c in the string x. A 
similar construction works for unary alphabets provided gcd(n, n') = 1. However, 
determining the best upper bound for unary languages when gcd(n, n') > 1 was 
stated as an open problem by Yu m- 

In this paper we prove bounds for the state complexity of the intersection of 
unary regular languages, and we show the problem is related to an interesting 
function from number theory due to Jacobsthal. Since L U L' = L IT L', and 
sc(L) = sc(L), our results apply equally well to the state complexity of the union 
of unary regular languages. 

C. Nicaud jE] has recently investigated the average state complexity for var- 
ious operations on unary languages, including intersection. 

* Research supported in part by a grant from NSERC. 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 272- E7S1 2001. 
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2 Unary Deterministic Finite Automata 

Let S = {a} and let M = {Q, S,6,qo, F) be a deterministic finite automaton 
with n states. Then by the pigeonhole principle, the transition diagram of M has 
a “tail” consisting of t > 0 states and a “cycle” of c > 1 states. Furthermore, if 
the transition diagram is connected (as we may assume without loss of generality) 
then n = t + c. See Figure ^ 




Fig. 1. Transition diagram of M (accepting states not identified) 



We define T{M) = t and C{M) = c. It follows that there exist sets A C 
{e, a, . . . , and B C {a‘, . . . , such that 

L{M) = A + B{ay (1) 



Theorem 1. Let AI and M' be two unary deterministic finite automata. Sup- 
pose the transition diagram of M (resp. M' ) has a tail of size t and a cy- 
cle of size c (resp. t' , o'). Then the state complexity of L{M) fl L(M') is 
< max(t, t') + lcm(c, c') . 

Proof Write L(M) = T + B(a=)*, as in Eq. (DJ, and L(M') = A' + B' )* . We 
have a® G L{M) iff [s < t implies a® G T and s > t implies there exists a'^ G B 
such that s = y (mod c)]. Similarly, a® G L{M') iff [s < t' implies a® G A' and 
s > t' implies there exists G B' such that s = z (mod c')]. 

Then, by the Chinese remainder theorem, there exists a set B” such that 
if s > max(t, t'), then a® G L(M) fi L{M') iff there exists u G B” such that 
s = u (mod lcm(c, c')). Hence we can accept L{M) fl L{M') using a cycle of 
size lcm(c, c') and a tail of size max(t, t'). The upper bound follows. ■ 

We now show that the upper bound in Theorem Q is best possible. 
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Theorem 2. For all t,t' > 0 and c,c' > 1 there exist two deterministic finite 
automata M, M' with T{M) = t, C{M) = c, T{M') = t' , C{M') = c' such that 
sc{L{M) n L(M')) = max(t, t') + lcm(c, c'). 

Proof. Let I = lcm(c, c'). 

If t = = 0, then let L = L' = (a'^ )*. Then L (respectively L') 

may be accepted by a DFA M (respectively M') with T{M) = 0 and C{M) = c 
(respectively T{M') = 0 and C{M') = c'). Now L D L' = (a*)*. It is now easy to 
see that (a*)* may be accepted by a DFA with I states, and by the Myhill-Nerode 
theorem (e.g., II §3.4]), this is best possible. 

Otherwise, at least one of t, t' is non-zero. Without loss of generality, assume 
t>t' and hence t > 0. Define 



L = a^+^-\ay 
L' = ar{a^'y 

where r := (t — 1) mod c'. It is easy to see that L (respectively, L') can be 
accepted by a DFA M (respectively, M') with T{M) = t, C{M) = c (respec- 
tively, T{M') = t', C{M') = d). (In fact, L' can be accepted by a DFA M' with 
T(M') = 0.) 

We claim L C\ L' = a*“*'*“^(a*)*. To see this, note that a® £ L iff s = 
(t -I- c — 1) -I- /cc for some integer fc > 0. Similarly, letting t — 1 = qd + r with 
0 < r < c', we have a® £ L' iff s = r -I- jd for some integer j > 0, i.e., iff 
s = (t — 1) -I- (j — q)d . Thus G L 0 L' iff t -|- c — 1 -|- /cc = (f — 1) -|- (j — q)d, 
which is the case iff (fc -I- l)c = {j — q)d . But this equation has integer solutions 
iff (fc -I- 1) = bd /g and j — q = bc/g for some integer b, where g = gcd(c, d). But 
A: > 0 iff 6 > 1. It now follows that 

L n L' = {a(*+c-i)+(fcc7s-i)c : 6 > 1} 

= ■ b>l} 

= a‘+'-i(a')* 



as desired. 

Now an easy application of the Myhill-Nerode theorem proves that 



sc(a*+'-^(o‘)*) =t + L 



After this paper was completed, I learned that Theorems [U and 0 were ob- 
tained independently by G. Pighizzini 0. 

Together, Theorems 0 and 0 imply that to understand the state complexity 
of the intersection of regular languages, we need to estimate the function 

F{n,n') = max ( max(n — c, n' — c') -1- lcm(c, c') ) . 

l<c<n 

l<c'<n' 
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This in turn suggests studying the somewhat simpler and more natural function 



To the best of our knowledge, neither F nor G has been studied previously, 
although we will see in the next section that both functions are closely related 
to the Jacobsthal function. 

3 Jacobsthal’s Function 

Jacobsthal’s function g{n) is defined to be the least integer r such that every 
set of r consecutive integers contains at least one integer relatively prime to n 
0. Below we show an interesting connection between this problem and state 
complexity for intersection of unary languages. 

First, however, we state two known upper bounds on this function. The first 
is an explicit bound due to Kanold 0: 

Theorem 3. Let w(n) denote the number of distinct prime factors of the posi- 
tive integer n. Then g{n) < 2“^"^ for all integers n > 1. 

The second bound is due to Iwaniec 0: 

Theorem 4. There exists a constant ci such that g{n) < ci(logn)^ for all in- 
tegers n > 1. 

First we obtain a lower bound on G (and hence F): 

Theorem 5. Let n < n' . Then there exists a constant ci such that F(n,n') > 
G(n, n') > nn' — ci(logn)^n. 

Proof. By Iwaniec’s theorem, there exists k with 0 < fc < ci(logn)^ such that 
gcd(n,n' — k) = 1. Hence G{n,n') > n{n' — k) > n{n' — ci(logn)^). ■ 

Carl Pomerance has kindly pointed out to me (personal communication) that 
the lower bound of Theorem0can be improved in the case where n and n' do not 
differ much in size, as follows. We use a result of Adhikari and Balasubramanian 



Theorem 6. Ifx,y are positive integers < N, then there exist integers a,h with 
a = 0(logloglog A^) and5 = 0((log A^)/(loglog A^)) such that gcd{x — a, y — b) = 

1 . 

Using this theorem, we obtain the following: 

Theorem 7. 



G{n,n') = max lcm(c,c^). 



l<c<n 
1 <c^ <n^ 



H: 




,7 

(log log rt) (log log log n) ’ 



, then there exists a constant ci such that 



F{n,n') > G{ 
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(^) V (iogiogn)(l?g"ogiogn) ^ ^ io"gio|ioU there exists a constant C 3 such 

that F{n,n') > G{n,n') > nn' — C3(logloglogn)n'. 

Next, we find a upper bound on F . First we prove the following lemma. 
Lemma 1. Let n,n' be fixed positive integers. The quantity 
Q{c, c') := max(n — c,n' — c') + lcm(c, c') 



is maximized (1 < c < n, 1 < o' < n' ) only «/gcd(c, c') = 1 . 

Proof. Assume not. Then Q{c,c.) is maximized for some c, c' with gcd(c, c') = 
g > 1. Assume without loss of generality that n > n' . For n < 11 the theorem 
can be verified by a simple computer program. Hence assume n > 11. 

We have max(n—c, n'—c') < n and lcm(c, c') = ^ < ^,soQ(c, c') < n+^. 
By TheoremElwe know there exists a, k, 1 < k < such that gcd(n,n — 
k) = 1. Since Q{c, c') is a maximum, we have Q{c, c') > n{n — k) + k > n{n — 
Putting the inequalities for Q together, we get 

2 

<«+y, 

and so n - < 1 + f - Thus n < 2(2‘^(") + 1). 

However, we claim that n > 2(2“^”) + 1) for n > 11. For 11 < n < 141 this 
follows by an explicit calculation. Otherwise n > 142. We now use a theorem of 
Robin 0 which states uj{n) < t{n) where 

,(„) := ^ + 1.45743-Jy-j. 

log log n (log log n) ^ 



Since n > 142, we have log log n > 1.6 and so 



t{n) < 



log 2 >^ 

l.GlogaC 



1.45743 



log 2 ^ 
2.56 log 2 e 



< .83 log 2 n. 



We thus obtain 

2(2‘^(«)+i) < 2(2‘(") + 1) < 2(n-®3 + 1) 

and it is easily verified that 2(n'®^ + 1) < n for n > 70. This contradiction 
completes the proof. ■ 



Remark. Ming-wei Wang points out (personal communication) that a slightly 
weaker result is much easier to prove: namely, that Q achieves its maximum at 
some (c, c') with gcd(c, c') = 1 (as opposed to “only if”). For if gcd(c, c') > 1, 
then write c = 2®^3®^ • • -p^jf and c' = 2 -^'^ 3-^^ • • where pi is the fc’th prime 
and pk is the largest prime dividing either c or c'. Let d = ni<*<'= pT ^.nd d' = 

^i<fi 

ni<i<fc p{' . Then lcm{c/d,c' /d') = lcm(c, c'), and hence we have Q{cjd,c'ld') > 

^i>h 

Q{c,c'). However gcd{c/d,c'/d') = 1. 
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Remark. Note that F(n, n') = max i<c<n max(n — c^n' — d) + lcm(c, c') does 

l<c'<n' 

not necessarily achieve its maximum at the same pair {c,c') which maximizes 
G{n,n') = max i<o<n lcm(c,c'). For example, F(148,30) = 4295, which is 

l<c'<n' 

uniquely achieved at (c, c') = (143,30), while G(148, 30) = 4292, which is 
uniquely achieved at (c, c') = (148,29). 

We can now prove our upper bound. 

Theorem 8. There exist a constant C4 and infinitely many distinct pairs n, n' 
with n' < n such that G{n, n') < F{n, n') < nn' — 

Proof. Let d > 1 be a fixed integer. Let Sd = {(*,j) : i,j^^ and i+j < d}. For 
each pair (i, j) € Sd, choose a distinct prime qij from the set {pi,p 2 , ... ,Pv}, 
where pi denotes the Fth prime and v = d{d + l)/2. By the Chinese remainder 
theorem, we can find n, n' such that qij \ n — i and qij \n' — j for all pairs 
{i,j) G Sd- Furthermore, we may choose n and n' such that K < n' < 2K, 
2K < n < 3K, where K ^^e prime number theorem (e.g., 

0), we have K = Hence there exists a constant C 5 such that 

It follows that gcd(n — i,n' — j) > 1 for all pairs (i,j) S Sd- By Lemma 0 
we know that F cannot achieve its maximum when (c,d) S Sd- 
It follows that F{n, n') < max{,+c=d {{n — b){n' — c) + d). But 

max ((n — b)in' — c) + d) < nn' — dn' + df /A + d. 

b+c=d 

Hence F{n,n') < n'{n — d) + d'^ /A + d. Since n' > n/3, the desired result follows. 



Remark. This result suggests defining a function S{n) to be the least positive 
integer r such that there exists an integer m, 0 < m < r, with gcd(r— i, m—j) > 1 
for 0 < i,j < n. By an argument similar to that given above, we know that 
S(n) < The following table gives the first few values of S(n): 



n 


S(n) 


m 


1 


2 


0 


2 


21 


15 


3 


1310 


1276 



It is possible to prove through brute force calculation that 450000 < 5(4) < 
172379781. The upper bound follows from the fact that if 

(x,y) = (172379781,153132345), 

then we have gcd(a; — z, j/ — j) > 1 for 0 < i,j < 4. 
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Abstract. We have implemented a package that transforms concise 
algebraic descriptions of linear block codes into hnite automata repre- 
sentations, and also generates decoders from such representations. The 
transformation takes a description of the code in the form of a fc x n 
generator matrix over a field with q elements, representing a finite 
language containing strings, and constructs a minimal automaton 
for the language from it, employing a well known algorithm. Next, 
from a decomposition of the minimal automaton into subautomata, it 
generates an overlayed automaton, and an efficient decoder for the code 
using a new algorithm. A simulator for the decoder on an additive white 
Gaussian noise channel is also generated. This simulator can be used 
to run test cases for specific codes for which an overlayed automaton 
is available. Experiments on the well known Golay code indicate that 
the new decoding algorithm is considerably more efficient than the 
traditional Viterbi algorithm run on the original automaton. 

Keywords: block codes, minimal trellis, decoder complexity, subtrellis 
overlaying. 



1 Introduction 

The theory of finite state automata has many interesting connections to the field 
of error correcting codes [13]. After the early work on trellis representations of 
block codes[l,22,15,18,7], there has recently been a spate of research on minimal 
trellis representations of block error correcting codes[8,12,17]. Trellis descriptions 
are combinatorial descriptions, as opposed to the traditional algebraic descrip- 
tions of block codes. A minimal trellis for a linear block code is just the transition 
graph for the minimal finite state automaton which accepts the language con- 
sisting of the set of all codewords. With such a description, the decoding problem 
reduces to finding a cheapest accepting path in such an automaton(where tran- 
sitions are assigned costs based on a channel model). However, trellises for many 
useful block codes are often too large to be of practical value. Of immense inter- 
est therefore, are tail-biting trellises for block codes, recently introduced in [3], 
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which have reduced state complexity. The strings accepted by a finite state ma- 
chine represented by a trellis are all of the same length, that is the block length 
of the code. Coding theorists therefore attach to all states that can be reached 
by strings of the same length I, a time index 1. Conventional trellises use a linear 
time index, whereas tail-biting trellises use a circular time index. It has been ob- 
served[21] that the maximum state complexity of a tailbiting trellis at any time 
index can drop to the square root of the maximum state complexity of a conven- 
tional trellis for the code, thus increasing the potential practical applications of 
trellis representations for block codes. It is shown in [6] that tailbiting trellises 
can be viewed as overlayed automata, i.e. a superimposition of several identically 
structured finite state automata, so that states at certain time indices are shared 
by two or more automata. This view leads to a new decoding algorithm that is 
seen to be more efficient than the traditional Viterbi algorithm. Some prelimi- 
nary details [10] are available for the construction of minimal tailbiting trellises 
from conventional trellises. The full theory is currently being worked out [11]. 

In this paper, we describe a software package that provides the following 
facilities. 

1. It constructs a minimal finite state automaton (conventional trellis) for a 
linear block code from a concise algebraic description in terms of a generator 
matrix using the algorithm of Kschischang-Sorokine[12]. This involves two 
steps: First, the conversion of the generator matrix into trellis oriented form. 
This is a sequence of operations similar to Gaussian elimination. Second, the 
generation of the minimal trellis as the product trellis of smaller trellises. Note 
that the algebraic description is concise, consisting of a, k x n matrix with 
entries from a field with q elements. The language consists of strings. 

2. Given a set of identically structured automata(subtrellises), that can be su- 
perimposed to produce an overlayed automaton(tailbiting trellis), it pro- 
duces a decoder for the block code using a new algorithm described in [6] . 

3. Given the parameters of an additive white Gaussian noise (AWGN) channel 
it simulates the decoder on the overlayed automaton and outputs the de- 
coded vector and other statistics for the range of signal to noise ratios(SNR) 
requested by the user. 

Simulations on the hexacode[5], and on the practically important Golay code, 
the tailbiting trellises of which are both available [3], indicate that there is a 
significant gain in decoding rate using the new algorithm on the tailbiting trellis 
over the Viterbi algorithm on the conventional trellis. 

We hope to augment the package with a module to convert from a minimal 
conventional trellis to a minimal tailbiting trellis when the technique (stated to 
be polynomial time in [10]) becomes available. 

Section 2 describes conventional trellises for block codes and the Kschischang- 
Sorokine algorithm used to build the minimal trellis; section 3 defines tailbiting 
trellises and overlayed automata; section 4 describes the decoding algorithm; 
section 5 describes our implementation and presents some results obtained by 
running our decoder on a tailbiting trellis for the Golay code. Finally section 6 
concludes the paper. 
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2 Minimal Trellises for Block Codes 



In block coding, an information sequence of symbols over a finite alphabet is 
divided into message blocks of fixed length; each message block consists of k 
information symbols. If q is the size of the finite alphabet, there are a total of q^ 
distinct messages. Each message is encoded into a distinct codeword of n (n > k) 
symbols. There are thus q^ codewords each of length n and this set forms a block 
code of length n. A block code is typically used to correct errors that occur in 
transmission over a communication channel. A subclass of block codes, the lin- 
ear block codes has been used extensively for error correction. Traditionally such 
codes have been described algebraically, their algebraic properties playing a key 
role in hard decision decoding algorithms. In hard decision algorithms, the sig- 
nals received at the output of the channel are quantized into one of the q possible 
transmitted values, and decoding is performed on a block of symbols of length 
n representing the received codeword, possibly corrupted by some errors. By 
contrast, soft decision decoding algorithms do not require quantization before 
decoding and are known to provide significant coding gains [4] when compared 
with hard decision decoding algorithms. That block codes have efficient combi- 
natorial descriptions in the form of trellises was discovered in 1974 [1]. Other 
early seminal work in the area appears in [22] [15] [7] [18]. For background on 
the algebraic theory of block codes, readers are referred to the classic texts [14, 
2,16]; for trellis structure of block codes, [19] is an excellent reference. 

Let Fq be the field with q elements. It is customary to define linear codes alge- 
braically as follows: 

Definition 1. A linear block code C of length n over a field Fq is a k- 
dimensional subspace of an n-dimensional vector space over the field Fq (such a 
code is called an {n,k) code). 

The most common algebraic representation of a linear block code is the generator 
matrix G. A k x n matrix G where the rows of G are linearly independent and 
which generate the subspace corresponding to C is called a generator matrix 
for G. Figure 1 shows a generator matrix for a (4,2) linear code over F 2 . A 



0 110 
10 0 1 



Fig. 1. Generator matrix for a (4, 2) linear binary code 



general block code also has a combinatorial description in the form of a trellis. 
We borrow from Kschischang and Sorokine [12] the definition of a trellis for a 
block code. 
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Definition 2. A trellis for a block code C of length n, is an edge labeled directed 
graph with a distinguished root vertex s, having in-degree 0 and a distinguished 
goal vertex f having out-degree 0, with the following properties: 

1. All vertices can be reached from the root. 

2. The goal can be reached from all vertices. 

3. The number of edges traversed in passing from the root to the goal along any 
path is n. 

4 . The set of n-tuples obtained by “reading off” the edge labels encountered in 
traversing all paths from the root to the goal is C. 

The length of a path (in edges) from the root to any vertex is unique and is 
sometimes called the time index of the vertex. One measure of the size of a trellis 
is the total number of vertices in the trellis. It is well known that minimal trellises 
for linear block codes are unique [18] and constructable from a generator matrix 
for the code [12]. Such trellises are known to be biproper. Biproperness is the 
terminology used by coding theorists to specify that the finite state automaton 
whose transition graph is the trellis, is deterministic, and so is the automaton 
obtained by reversing all the edges in the trellis. (Formal language theorists 
call such languages bideterministic languages). In contrast, minimal trellises for 
non-linear codes are, in general, neither unique, nor deterministic [12]. Figure 2 
shows a trellis for the linear code in figure 1. 




Fig. 2. A trellis for the linear block code of figure 1 with So = s and Sq = f 



2.1 Constructing Minimal Trellises for Block Codes 

We briefly describe the algorithm given in [12] for constructing a minimal 
trellis, implemented in our package. An important component of the algo- 
rithm is the trellis product construction, whereby a trellis for a “sum” code 
can be obtained as a product of component trellises. Let Ti and T 2 be the 
component trellises. We wish to construct the trellis product T 1 .T 2 . The set 
of vertices of the product trellis at each time index, is just the Cartesian 
product of the vertices of the component trellis. Thus if i is a time index, 
Vi{Ti.T 2 )= VifTi) X Vi{T 2 ). Consider Ei{Ti) x Ei{T 2 ), and interpret an element 
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{{vi,ai,v[),{v2,ce2,V2)) in this product, where are vertices and 01,02 

edge labels, as the edge {{vi,V2),ai + 02, where + denotes addition in 

the field. If we define the section as the set of edges connecting the vertices 
at time index i to those at time index i + 1 , then the edge count in the sec- 
tion is the product of the edge counts in the section of the individual trellises. 

Before the product is constructed we put the matrix in trellis oriented form 
described now. Given a non zero codeword C = (ci, C2, . . . c„), start{C) is the 
smallest integer i such that Cj is non zero. Also end{C) is the largest integer 
for which Ci is nonzero. The span of C is [start{C) , end{C)]. By convention the 
span of the all 0 codeword 0 is []. The minimal trellis for the binary (n, 1) code 
generated by a nonzero codeword with span [a, b] is constructed as follows. There 
is only one path up to a — 1 from index 0, and from b to n. From a — 1 there are 2 
outgoing branches diverging(corresponding to the 2 multiples of the codeword), 
and from 6 — 1 to 6, there are 2 branches converging. For a code over Fq there 
will be q outgoing branches and q converging branches. It is easy to see that this 
is the minimal trellis for the 1-dimensional code. 

To generate the minimal trellis for C we first put the trellis into trellis ori- 
ented form, where for every pair of rows, with spans [oi, 61], [02, 62], ai yf bi and 

02 yf 62- We then construct individual trellises for the k 1-dimensional codes 
as described above, and then form the trellis product. Conversion of a gener- 
ator matrix into trellis oriented form requires a sequence of operations similar 
to Gaussian elimination, applied twice. In the first phase, we apply the method 
to ensure that each row in the matrix starts its first nonzero entry at a time 
index one higher than the previous row. In the second phase we ensure that no 
two rows have their last nonzero entry at the same time index. We see that the 
generator matrix displayed earlier is already in trellis oriented form. The com- 
plexity of the Kschischang-Sorokine algorithm is 0 {k.n -\- s) for an (n, k) linear 
code whose minimal trellis has s states. 

3 Tailbiting Trellises 

We borrow the definition of a tailbiting trellis from [10]. 

Definition 3. A tailbiting trellis T = (V,E,Fq) of depth n is an edge labelled 
directed graph with the following property. The vertex set can be partitioned as 
follows: V = Vb U Vi U such that every edge in T either begins at a 

vertex of Vi and ends at a vertex of Vi+i for some i = 1, 2, . . . n — 2 or begins at 
a vertex ofVn-i and ends at a vertex o/Vq- 

The notion of a minimal tailbiting trellis is more complicated than that of 
a conventional trellis. The ordered sequence, (jVoL |hi|, . . . |Ki-i|) is called the 
state complexity profile of the tailbiting trellis. For a given linear code C, the 
state complexity profiles of all tailbiting trellises for C form a partially ordered 
set under componentwise comparison. A trellis T is said to be smaller or equal to 
another trellis T' , denoted by T <s T' if jVij < jV^'j for alH = 0, 1 , . . . n — 1 . It is 
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smaller if equality does not hold for all i in the expression above. We say that a 
tailbiting trellis is minimal under <5 if a smaller trellis does not exist. For con- 
ventional trellises the minimal trellis is unique. However, for tailbiting trellises, 
there are, in general, several nonisomorphic minimal trellises that are incompa- 
rable with one another. Yet another ordering on tailbiting trellises is given by the 
product ordering <p. This is a total ordering. T <p T' iff nr=o^ 1^*1 < nr=o^ l^/l- 
It is stated in [10] that T <5 T' 4=^ T <p T' . An outline for constructing a 
minimal tailbiting trellis for a linear block code is given in [10]. The complex- 
ity is stated to be O(n^) for a code of length n. The detailed theory in under 
preparation[ll]. Figure 4 is a tailbiting trellis for the linear code of figure 1. Let 
Smax{T) denote the maximum number of states of trellis T at any time index, 
when the index is allowed to range from 0 to n. 

It is shown in [ 6 ] that a tailbiting trellis can be viewed as an overlayed automa- 
ton. This is a somewhat more natural view, we believe, and it also leads to an 
efficient decoding algorithm on tailbiting trellises. 

3.1 Tailbiting Trellises as Overlayed Automata 

An overlayed trellis has been introduced in [ 6 ], and we give the definition 
below. Let C be a linear code over a finite alphabet. Let Cq,Ci, . . .Ci be a 
partition of the code C, such that Co is a subgroup of C under the operation 
of componentwise addition over the structure that defines the alphabet set of 
the code (usually a field or a ring), and Ci,...C; are cosets of Cq in C. Let 
Ci = Co hi where hi,l hi < I are coset leaders, and let Ci have minimal 
trellis Ti- The subcode Cq is chosen so that the maximum state complexity is 
N (occurring at some time index, say, m), where N divides M the maximum 
state complexity of the conventional trellis at that time index. The subcodes 
Co, Cl, . . . C are all disjoint subcodes whose union is C. Further, the minimal 
trellises for Cq,Ci,...C/ are all structurally identical and two way proper. 
(That they are structurally identical can be verified by relabeling a path labeled 
9192 • ■ • in Co with gi hi ^ , 52 + ^12 • • ■ + hi^ in the trellis corresponding to 

Co -I- hi where hi = hi^^hi^ ■ ■ ■ hi^.) We therefore refer to Tj, T 2 , . . . T/ as copies 
of To. 



Definition 4. An overlayed proper trellis is said to exist for C with respect to 
the partition Co,Ci,...C; where Ci,0 < i < I are subcodes as defined above, 
corresponding to minimal trellises Tq,Ti, . . .Ti respectively, with Smax{To) = N , 
iff it is possible to construct a proper trellis T„ satisfying the following properties: 

1. The trellis Ty has I 1 start states labeled [so, 0, 0, ■ • • 0]) [0, si, 0 . . . 0] . . . 
[0, 0, ... 0, s;] where Si is the start state for subtrellis Ti,l < i < 1. 

2. The trellis Ty has I 1 final states labeled [fo, 0, 0, . . . 0], [0, /i, 0, . . . 0], . . . 
[0, 0 , ... 0, //], where fi is the final state for subtrellis Ti, 0 < i < 1. 

3. Each state ofTy has a label of the form [po,pi, . . .pi] where pi is either 0 or 
a state ofTi,0 <i<l. Each state ofTi appears in exactly one state ofTy. 
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4- There is a transition on symbol a from state labeled [po,Pi, ■ ■ -Pi] to 
[qo^qi, . ■ . qi] in T„ if and only if there is a transition from pito qi on symbol 
a in Ti, provided neither pi nor qi is 0, for at least one value of i in the set 
{0,1,2,.. J}. 

5. The maximum width of the trellis T„ at an arbitrary time index z, 1 < z < 
rz — 1 is at most N. 

6. The set of paths from [0, 0, . . . Sj, . . . 0] to [0, 0, . . . , /j, . . . 0] is exactly Cj,0 < 
J<1- 

Let the state projection of state [po,Pi, ■ ■ ■ ,Pi, ■ ■ ■ ,pi] into subcode index i be 
Pi if Pi 0 and empty if pi = 0. The subcode projection of Ty into subcode index 
i is defined by the symbol |r„|i and consists of the subtrellis of Ty obtained by 
retaining all the non 0 states in the state projection of the set of states into 
subcode index z and the edges between them. An overlayed trellis satisfies the 
property of projection consistency which stipulates that |T^|i = Ti. Thus every 
subtrellis Tj is embedded in Ty and can be obtained from it by a projection 
into the appropriate subcode index. We note here that the conventional trellis 
is equivalent to an overlayed trellis with M/N = 1. 

Figure 3 shows the subtrellises for component codes Co and Ci of the linear code 
defined earlier, overlayed to obtain the tailbiting trellis in Figure 4. 




(a) (b) 



Fig. 3. Minimal trellises for (a) Co = (0000, 0110} and (b) Ci = {1001, 1111} 




Fig. 4. Trellis obtained by overlaying trellis in figures 3(a) and 3(b) 



It is shown that not all decompositions give overlayed trellises satisfying the 
specified bound on the width. Necessary and sufficient conditions for a decom- 
position to yield a tailbiting trellis with a specified bounded width are also given 
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in [6] . For purposes of decoding, we need the decomposition of the original con- 
ventional trellis into subtrellises that can be overlayed to form a tailbiting trellis. 
Each subtrellis has been shown to correspond to a coset of a group, and can be 
generated from the appropriate coset leader. 

4 Decoding 

Decoding refers to the process of forming an estimate of the transmitted 
codeword x from a possibly garbled version y. The received vector consists of a 
sequence of n real numbers where n is the length of the code. The soft decision 
decoding algorithm can be viewed as a shortest path algorithm on the trellis for 
the code. Based on the received vector, a cost l{u,v) can be associated with an 
edge from node u to node v. The well known Viterbi decoding algorithm [20] 
is essentially a dynamic programming algorithm, used to compute a shortest 
path from the source to the goal node. Define a winning path as a shortest path 
from one of the start nodes to a final node. For a tailbiting trellis, decoding is 
complicated by the fact that non-accepting paths in the overlayed automaton 
share states with accepting paths. Thus a winning path at a goal node may 
not be an accepting path. We have designed and implemented a two phase 
algorithm which outputs a winning accepting path. The algorithm is outlined in 
[6] and described in detail along with proofs of correctness in [5]. We describe 
it informally here, along with a tiny example. 

During the first phase, a Viterbi algorithm is run on the trellis and survivors 
i.e. shortest paths from any source node to all nodes are computed. Each 
node stores the cost of the survivor at itself at the end of the first phase. 
The second phase is an adaptation of the A* algorithm[9], well known in the 
artificial intelligence community, and may be viewed as an adaptation of the 
Dijkstra algorithm with node to goal estimates added to source to node costs. 
All survivors at goal nodes are gathered at the end of the first phase. We term 
accepting paths as Si — fi paths and non accepting paths as Sj — fj paths, 
with i ^ j. The second phase only needs to look at subtrellises Tj such that 
the winning path in Tj is an Sj — fj path and such that there are no Sk — fk 
paths with smaller cost, (we call such trellises, residual trellises) as the cost 
of a winning Si — fj path will be an underestimate of that of the winning 
Sj — fj path. Any Si — fj path with estimated cost greater than an Sk — fk 
path can therefore never be a winner. All residual trellises are candidates for 
the second phase. Decoding begins at the best candidate, i.e the one with the 
least estimate. The current estimates of all other residual trellises are stored in 
a heap. If at any instant the estimated cost of the current trellis exceeds the 
minimum value on the heap, the search in the current trellis is terminated, its 
estimate inserted into the heap, and the trellis corresponding to the minimum 
value in the heap taken up for searching next. Thus the algorithm makes its 
way towards the goal travelling on the best subtrellis seen so far at any given 
instant. As soon as the goal node is reached, we are sure that we have the 
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winning path. For high signal to noise ratios it is observed that the algorithm 
either does not need the second phase at all, or that it usually stays on a single 
subtrellis for the whole of the second phase. We illustrate with a tiny example. 
Though this one has only one residual trellis, it serves to illustrate the idea. 

Figure 5 gives an overlayed trellis with some hypothetical costs. The nodes 
are labeled with survivors (within parentheses) after the first phase. Since the 
four codewords have costs 11,9,10,12, the winning path is acefg with a cost of 9. 
The first phase outputs winners with cost 6 at g and 10 at h, corresponding to 
paths bcefg and bcefh. Subtrellis corresponding to Si — fj pair b — g is a residual 
trellis so we begin decoding at a with estimate 6. At c the estimate changes to 
4+(6-l)=9; at d it is 8+(6-5)=9; at e it is 5-(6-2)=9; at / it is mzn(9+(6-4), 
7+(6-4))=9; at g it is 9+0=9. Hence the winning path is acefg. 




Fig. 5. Tailbiting trellis with hypothetical edge costs and survivors after first phase 



5 Implementation 

The package has been implemented in C.The minimal trellis construction 
algorithm of Kschischang-Sorokine that is implemented has been validated by 
generating minimal trellises for several codes for which these structures have 
been published, among them the (48,24) quadratic residue code, the (24,12) 
Golay code and the (16,7) lexicode(for which the state complexity profiles are 
available in [19].) 

For the decomposition, subtrellises are generated from a tailbiting trellis by 
carrying out a forward traversal from each start state and keeping track of 
which nodes reachable from the start state also reach the appropriate final 
state. Thus the complexity of subtrellis generation from a tailbiting trellis is 
0{s) where s is the number of states. Additional storage is not required for 
the subtrellises as each node of the tailbiting trellis has a vector of length I + 1 
associated with it (where I + 1 is the number of subtrellises), which indicates 
whether a node of the tailbiting trellis is present in a certain subtrellis or not. 
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For the decoding, the channel is modeled as an AWGN channel using a random 
number generator that produces numbers that are normally distributed with 
mean 0 and variance where Nq is the noise energy level. The tail-biting trellis 
is implemented as a two dimensional array of states reprepresenting vertices. 
Each state contains incoming and outgoing branches. Branches contain labels 
that correspond to a codeword symbols. If there are I -I- 1 subtrellises, each state 
has storage for the I + 2 surivivor paths (one obtained in the first phase and the 
remaining ^ 1 for each individual subtrellis to be used in the second phase), 

and for the corresponding metric. The metric along a branch is computed at 
runtime depending on the generated random codeword. Each state contain a 
membership array of length ^ 1 to indicate the membership in subtrellises. If 

vertex v belongs to subtrellis i, then the z’th bit of the array is set to 1, otherwise 
it is set to 0. The heap required in the second phase is implemented as an array. 




Fig. 6. Rates of decoding using the Viterbi algorithm and the two phase algorithm for 
the Golay code 



5.1 Simulation Results 

We present the simulation results of the algorithm tested on the extended (24,12) 
Golay Gode. The rate of the code is ^ . The minimal tail-biting trellis of the 12- 
section (24,12) Golay code has a uniform state complexity of 16 at all time indices 
and is described in [3]. The conventional 12-section trellis of the (24,12) Golay 
Gode has the state complexity profile (1,4,16,64,256,256,256,256,256,64,16,4,1). 
The conventional trellis has 1066 states and 2728 branches and the tail-biting 
trellis has 208 states and 384 branches. Each branch of the 12-sectioned Golay 
code represents 2 code bits. 

A large number of random codewords is generated for various signal-to-noise 
ratios and the average rate of decoding is calculated by running the algorithm on 
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Fig. 7. Probability of a second pass in the two phase algorithm as a function of SNR 



the tail-biting trellis. The results are compared with the Viterbi rate of decoding 
on the conventional trellis. The result is presented in figure 6. The rate graph 
shows that the two phase algorithm on the tailbiting trellis is significantly better 
than Viterbi decoding on the conventional trellis even at a low SNR of OdB. While 
the Viterbi rate of decoding remains constant around 190 codewords/sec for SNR 
values in the range [0,6], the rate of the proposed algorithm increases steadily 
from 879 to 1181 codewords/sec. It is known[19] that the Viterbi algorithm on a 
trellis with V nodes and E edges requires \E\ multiplications and \E\ — jVj + 1 
additions. From the vertex and edge counts for the conventional and tailbiting 
trellises for the Golay code given above, we conclude that the overheads in heap 
operations and the number of switches are not significant. 

It is also seen that for high SNR values, the decoding rarely needs a second 
phase. From figure 7 we see that the probability that the algorithm requires 
a second phase decreases from 0.652 to 0.012 as SNR increases from 0 to 6dB. 
During the second phase the algorithm switches from one subtrellis to other if 
there is a subtrellis on top of the heap with smaller metric than the subtrellis 
that is currently being expanded. Figure 8 shows that the average number of 
switches in the second pass decreases steadily to 1.53 showing that the search in 
second phase is usually restricted to a single subtrellis for large SNR values. 

The measure of the probabilty of decoding error of a maximum likelihood 
decoder is given by the Bit Error Rate (BER). For a codeword C of an {n,k) 
code with each codeword symbol requiring m bits, mn bits are transmitted. If 
the decoder decodes to codeword C', whose mn bits differ from C in e locations, 
the bit error rate is . Thus, BER is the decoding error per bit. The bit error 
rate for the tail-biting trellis decoding is presented in figure 9. The probability 
of decoding error is very small for large SNR values, and the curve is consistent 
with others obtained in the literature for the Golay code . 
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Fig. 8. Average number of switches between trellises when there is a second pass 



6 Conclusion 

We have implemented a package for the implementation of block codes as trel- 
lises and an efficient decoding algorithm and simulator for tailbiting trellises. 
Inclusion of a module for the conversion from conventional to tailbiting trel- 
lises when the full theory is available will make this, we hope, an useful general 
purpose tool for the coding community. 
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Abstract. We describe an algorithm to deal with automatic error re- 
pair over unrestricted context-free languages. The method relies on a 
regional least-cost repair strategy with validation, gathering all relevant 
information in the context of the error location. The system guarantees 
the asymptotic equivalence with global repair strategies. 



1 Introduction 

Until recently, errors were simply recovered by the consideration of fiducial sym- 
bols to provide mile-posts for error recovery, which allows a reduction in time 
and space bounds, although it is not always easy to determine if all relevant 
information to the error recovery process has been seen jS|. The significant re- 
duction in cost processing has propitiated a renewable interest in methods that 
take into account the constraints on context. We can differentiate |S| two families 
of algorithms: one class, called local repair, make modifications to the input so 
that at least one more original input symbol can be accepted by the parser. The 
simplicity of these methods sometimes causes them to choose a poor repair ^ 

d 

In contrast to local techniques, the global repair algorithms examine the entire 
program and make a minimum of changes to repair all the syntax errors, although 
they expend equal effort on all parts of the program, including areas that contain 
no errors. In between the local and global methods, Levi 0, suggested regional 
repair algorithms that fix a portion of the program including the error and as 
many additional symbols as needed to assure a good repair. In relation to global 
and local methods, the regional algorithms must answer the additional question 
of determining just how large a region to repair. 

In addition, when several repairs are available, the system must provide some 
method of choosing among them, and a common strategy is to assign individual 
costs. A repair algorithm that guarantees finding the lowest-cost repair possi- 
ble, is called a least-cost repair algorithm. Our proposal is a regional least-cost 
strategy which applies a dynamic validation in order to avoid cascaded errors. 

2 A Dynamic Frame for Parsing 

We introduce our parsing frame, as implemented in Ice 0. Our aim is to 
parse sentences in the language C{Q) generated by a context-free grammar 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 293- E77T1 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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Q = {N, E, P, S), where N is the set of non-terminals, E the set of terminal 
symbols, P the rules and S the start symbol. The empty string will be repre- 
sented by e. 



2.1 The Operational Model 

We assume that we produce a push-down automaton (pda) from Q. In practice, 
we chose an lalr(1) device, which is possibly non-deterministic, for the language 
C{G)- Formally, a pda is a 7-tuple A = (Q, E, A, S, go, Qf) where: Q is the 
set of states, E the set of input symbols, A the set of stack symbols, go the 
initial state, Zq the initial stack symbol, Qf the set of final states, and S a finite 
set of transitions of the form p X a ^ q Y with p, g G Q, a G if U {e} and 
X,Y G Z\ U {e}. Let the pda be in a configuration (p, Xa, ax), where p is the 
current state, Xa is the stack contents with X on the top, ax is the remaining 
input where the symbol a is the next to be shifted, x £ E*. The application 
oi p X a q Y results in a new configuration (g, Y a, x) where the terminal 
symbol a has been scanned, X has been popped, and Y has been pushed. 

The algorithm proceeds by building a collection of items, compact represen- 
tations of the recognizer stacks, by applying transitions to existing ones, until no 
new application is possible. The algorithm associates a set of items S'^ , called 
itemset, for each input symbol Wi at the position i in the input string of length 
n, An item is of the form [p. A, S'™, S'™], where p G Q, A G A, S™ is the 

back pointer to the itemset associated to the symbol Wi at which we began to 
look for that configuration of the automaton, and S™ is the current itemset. 



2.2 The Recognizer 

Formally, given a transition r = <5(p. A, a) 9 {q,Y), we translate it to items of 
the following form: 



1.5([p,A,S™,S)' 

2.5([p,A,S™,S)' 



I, a) 9 [g,e,S™,S™], 

|,a)9[p,r,S™,S™+i], 

3. 5{[p, A, S™, S™], a) 9 [p, r, S™, S™], 

4. J([p,£,S™,S™],a) 9 5d([g,£,Sr,S™],a) 9 [g, £, S,™, S™], 

'iq£ Q such that 3 5{q, A, s) 9 (p. A) 



ify = A 
if A = a 
if A G N 
if A = e 



with (5 : It X A U {e} — ^ {It U 5^} and : It x A U {e} — >■ It, where It is the set 
of all items developed in the parsing process and 84 is called the set of dynamic 
transitions. Succinctly, we can describe the preceding cases as follows: 

1. A goto action from the state p to state g under transition A. 

2. A push of a from state p. The new item belongs to itemset 

3. A push of non-terminal A from state p. 

4. A pop action from state p, where g is an ancestor of state p under transition 
A. We generate a dynamic transition fa to treat the absence of information 
about the rest of the stack. This transition is applicable not only to the 
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configuration resulting from the first one, but also on those to be generated 
and sharing the same syntactic structure. 

Fairness and completeness of the dynamic construction are guaranteed by an 
equitable selection order It. To ignore redundant items we use a subsumption 
relation based on equality. Authors prove in [S| that time and space bounds are, 
in the worst case, O(n^) and 0{n^) respectivley, for inputs 

3 Regional Least-Cost Error Repair 

Following Mauney and Fischer in |3|, we talk about the error in a portion of 
the input to mean the difference between what was intended and what actually 
appears in the input. So, we can talk about the point of error as the point at 
which the difference occurs. The point of detection is the point at which the 
parser detects that there is an error in the input and calls the repair algorithm. 

Definition 1. Let wi,,n be an input string, we say that Wi is a point of error 
^ G Sf /5{p,X,Wi) = {q,Wi) 

The point of error is easily fixed by the parser itself and, in order to locate 
the origin of the error at minimal cost, we should try to limit the impact on the 
parse, focusing on the context of subtrees close to the point of error. 

Definition 2. Let Wi he a point of error for the input string we define 

the set o/ points of detection associated to Wi, as follows: 

detection(wi) = {wi' /3A & N, Wi>aWi\ 

and we say that A ^ Wi'aWi is a derivation defining the point of detection 
Wii G detection{wi). 

Intuitively, the error is located in the immediate left parse context, repre- 
sented by the closest viable node, or in the immediate right context, represented 
by the lookahead. However, sometimes can be usefull to isolate the parse branch 
in which the error appears. 

Definition 3. Let Wi he a point of error for we say that [p, X, S'™] G 

Sif is an error item iff: 3 a G A, 5{p,e,a) yf 0, and we say that [p, e, Sfi , Sfi] G 
Sfi is a detection item associated to Wi iff 3 a € E,S{p,A,a) yf 0, A G 
N defining Wi, such that: 

6{qi,e,Wi') 9 {qi,B2), 5{qi,B2,Wi^) 9 (( 72 , e) 

9 (,qji—\,Bjfj, 8{^qji—\,Bji,Wi>^ 9 (,qn,£^ 

S{qn,£,Wi>) 9 {qn,we), B^ ^ £, Vi G [l,n] 
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We talk about error and detection items, when they represent nodes includ- 
ing the recognition of points of error and detection, respectively. The condition 
for error items implies that no scan action is possible for token Wi. In the detec- 
tion case, conditions look for items recognizing a point of detection on a parse 
branch including an error item in Wi, disregarding empty reductions which are 
not relevant for this purpose. 

Definition 4. A modification M to & string of length n, wi,,n = vj\ . . . Wn, is a 
series of edit operations, Ei . . .EnEn+i, in which each Ei is applied to Wi and 
possibly consists of a series of insertions before Wi, replacements or deletion of 
Wi- The string resulting from the application of the modification M to the string 
w is written M{w). 

We now restrict the notion of modification to focus on a given zone of the 
input string, introducing the concept of error repair in this space. Intuitively, we 
look for conditions that guarantee the ability to recover the parse from the error, 
at the same time as it allows us to isolate repair branches by using the concept 
of reduction. We are also interested in minimizing the structural impact in the 
parse tree, and finally in introducing the notion of scope as the lowest reduction 
summarizing the process at a point of detection. 

Definition 5. Let x be a valid prefix in C{Q), and w € E* , such that xw is not 
a valid prefix in C{Q). We define a repair of w following x as M(w), so that: 

/S ^ Xi,,i^iA ^ xi,,i_iXi„mM{w), i < m 
3A € N / B ^ aAf3, \/B Xj„m.M{w), j < i 
/ A 4 -fCp, VC 4 Xi..mM{w) 

We denote the set of repairs ofw following x by repair(x, w), and A by scope(M). 

However, the notion of repair{x, w) is not sufficient for our purposes, since 
our aim is to extend the error repair process to consider all possible points of 
detection proposed by the algorithm for a given point of error, which implies 
simultaneously considering different valid prefixes and repair zones. 

Definition 6. Let e G E be a point of error, we define the set of repairs for e, 
as repair(e) = {xM{w) € repair{x,w) /w\ G detection{e)} , where detection(e) 
denotes the set of points of detection associated to e. 

We now need a mechanism to filter out undesirable repair processes, in order 
to reduce the computational charges. To do that, we should introduce comparison 
criteria to only select those repairs with minimal cost. 

Definition 7. For each a G E we assume the existence of positive insert, I (a); 
delete, D{a), and replace R{a) cost.fl. The cost of a modificatiorU M{w\,,n) is 



^ if any edit operation is not applied, we assume its cost to be zero. 

^ we assume that delete and replace operations are exclusive for the same token. 
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given by cost(M(wi..„)) = + D{wi) -I- R{wi)). In particular, 

Ejizjl(wl) means that several insertion hypotheses are possible before the token 
Wi is real. 



When several repairs are available on different points of detection, we need a 
condition to ensure that only those with the same minimal cost are considered, 
looking for the best repair quality. 



Definition 8. Let e G E be a point of error, we define the set o/ regional repairs 
for e, as follows: 



regional{e) = {xM{w) G repair{e) 



cost{M) < cost(M'), VM' G repair{x,w) , 
cost{M) = minLg„p„i,(e){cost(L)} ' 



It is also necessary to take into account the possibility of cascaded errors, 
that is, errors precipitated by a previous erroneous repair diagnostics. Previous to 
dealing with the problem, we need to establish the existing relationship between 
the regional repairs for a given point of error, and future points of error. 



Definition 9. Let Wi,Wj be points of error in an input string such that 

j > i. We define the set o/ viable repairs for Wi in wj, as follows: 



viable{wi,Wj) = {xM{y) G regional{wi) / xM{y) .. .wj valid prefix for L{Q)} 

Intuitively, the repairs in viable{wi, Wj) are the only ones capable of ensuring 
the continuity of the parse in Wi,,j and, therefore, the only possible repairs at 
the origin of the phenomenon of cascaded errors. 

Definition 10. Let wt be an point of error for the input string we say 

that a point of error wj , j > i is a point of error precipitated by Wi iff 



\/xM{y) G viable{wi,Wj), 3A G N defining Wj/ G detection{wj) 
such that A ^ (3scope{M) . . .Wj. 

Intuitively, a point of error Wj is precipitated by the result of previous repairs 
on a point of error Wi, when all reductions defining points of detection for wj 
summarize some viable repair for Wi in wj. 



4 The Algorithm 

We propose that the repair be obtained by searching the pda itself to find a 
suitable configuration to allow the parse to continue. At this point, our approach 
agrees with McKenzie et al. in although this method is not asymptotically 
equivalent to a global repair strategy, and introduces an unsafe technique to 
speed up the repair algorithm m. In relation to this last, McKenzie et al. propose 
a pruning mechanism in order to reduce the number of configurations to be dealt 
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Ai;= i}, i,ej^ 




M):. Wv Wi’ W,’ W\ 

'k ‘l-l *2 M ' 

Fig. 1. Error detection 



with during the repair process. This mechanism may lead to suboptimal regional 
repairs or may cause failure to produce any repair even if an error exists. 

The problem due to pruning is based on a simple condition that ignores a 
stack configuration if an earlier one had the same stack top. The motivation is 
that this newer configuration would not lead to any cheaper repairs than the 
older one. Our dynamic programming construction eliminates this problem by 
considering all possible repair paths. 



4.1 A Simple Case 

We assume that we deal with the first error detected in the input string. The 
major features of the algorithm involve beginning with a list of error items, with 
an error counter zero. In order to compute the error counter, we extend the item 
structure: [p, X, S'™, S™, e], where now e is the error counter accumulated in the 
recognition of A G iV U 27. 

For each error item, we successively investigate the corresponding list of 
detection items, one for each parse branch including the error item. One a point 
of error Wi has been fixed, we can associate to it different points of detection 
uij'j , . . . rci'j, , as is shown in Fig. Q1 Detection items are located by using the back 
pointer, that indicates the itemset where we have applied the last pda action. 
So, we recursively go back into its ancestors until we find the first descendant of 
the last node that would have to be reduced if the lookahead was corredH. 

Once the detection items have been fixed for the corresponding error item, 
on each of the parse branches relying on them we apply all possible transitions 
beginning at the point of detection. These transitions correspond to four error 
hypotheses, from a given item: 



— For scan transitions the item obtained is the same as for standard parsing. 



[ p ,£, a ™, 5 ™, o ] [p.w,,sr.s: 



2 + 1 ’ 



0 ] 



^ this information is directly obtained from the PDA. 
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— In the case of insertion hypothesis, we initialize the error counter by taking 
into account the cost of the inserted token, which is included as stack symbol, 
and we add the new item to the same itemset, preserving the back pointer. 



[p,e,Sf,Sr,0] 



insert a 



[p,a,S^,S^,I{a)], S{p,e,a)^$ 



— For deletion hypothesis, we initialize the error counter by taking into account 
the cost of the deleted token and we add the new item to the next itemset, 
using the same stack symbol. The back pointer is initialized to the current 
itemset. 



b,£,5;,5r,o] 



delete • 



[P,e,sr,sr+,,D{W,)] 



— Finally, for mutation hypothesis, we initialize the error counter by taking 
into account the cost of the replaced token and we add the new item to the 
next itemset. The back pointer is initialized to the current itemset, and the 
new token resulting from the mutation is included as stack symbol. 



[p,e,SJ,Sr,0] 



replace Wi by a 



[p,a,5“',S'’^i,i?(a)], (5(p,e,a)yf0 



We do that until a reduction verifying definition ^covers both error and detection 
items accepting a token in the remaining input string, as is shown in Fig. 0 
where , Wi" J delimits the scope of a repair detected at the point G 

detection{wi) . Once we have applied the previous methodology to each detection 
item considered, we take only those repairs with regional lowest cost, applying 
definition El At this moment the parse goes back to standard mode. 

We use a bottom-up strategy not only to parse the input, but also to compute 
error counters. This establishes a difference with McKenzie et al. P), that uses 
a bottom-up parsing architecture with a top-down computation of the error 
counters. An inheritance strategy to compute the error counters, make the task 
of propagating these counters on shared parse branches a complex one. In our 
case, error counters are initialized at each error hypothesis and summarized only 
at reduce actions time. So, dynamic transitions must include information about 
the accumulated error counter in the part of the reduce action to be shared. The 
process is illustrated in Fig. El for two reductions, Ai and Aj, over a same rule 
A — >■ Al . . . Xm sharing the last A^+i . . . A^ syntactic categories. We re-take the 
part of the error counter accumulated during the first reduction, . .+ei'^, 

for these common categories. 
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Fig. 3. Dynamic transitions in repair mode 



4.2 The General Case 

We now assume that the current repair process is not the first one and, therefore, 
can modify a previously repaired string. This arises when we realize that we come 
back to a detection item for which any parse branch includes a previous repair 
process. This process is illustrated in Fig. 0]for a point of error Wj precipitated 
by u>i, showing how the variable Aj>^ defining Wj summarizes the scope 

of a previous repair defined by Aii^. 



To deal with precipitated errors, the algorithm re-takes the previous error 
counters, adding the cost of the new error repair hypothesis to profit from the 
experience gained from previous repair processes. At this point, regional repairs 
have two important properties. First, it is independent of the shift-reduce parsing 
algorithm used. The second property is a consequence of the lemma below. 

Lemma 1. (The Expansion Lemma) Let Wi, Wj be points of error in wi,,n G S* , 
such that Wj is precipitated by wt, then 

min{j' /wji G detection{w j)} < min{i' jwi' = y\, xM{y) G viable(wi,wj)} 

Proof. Let Wf G E, such that Wi/ = y\, xM{y) G viable{wi,Wj) be a point 
of detection for Wi, for which some parsing branch derived from a repair in 
regional{wi) has successfully arrived at Wj. 



Ai- E [qi-, Aj..., i;”, i’l', e,...] 





Fig. 4. Dealing with precipitated errors 
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Let Wj be a point of error precipitated by xM{y) G viahle{wi, Wj). By definition, 
we can assure that 

3B G N/B ^ Wjiawj ^ /3scope(M ) . . .Wj ^ (3xi,,mM{y) . . . Wj, Wi> = yi 

Given that scope{M) is the lowest variable summarizing uij/, it immediately 
follows that j' < i', and we conclude the proof by extending the proof to all 
repairs in viable(wi,Wj). □ 

Corollary 1. Let Wi, wj be points of error in wi,,n G , such that Wj is 
precipitated by Wi, then 

max{scope{M) ^ M G viable{wi,Wj)} C maa;{scope(M), M G regional{wj)} 

Proof. It immediately follows from lemma 0 □ 

This allow us to get an asymptotic behavior close to global repair methods. 
This property has profound implications for the efficiency, measured by time and 
space taken, the simplicity and the power of computing regional repairs. 

Corollary 2. Let w\,,n be an input string with a point of error in Wi, i G [1, n], 
then the time and space bounds for the regional repair algorithm are 0{n^) and 
O(n^), in the worst case, respectively. 

Proof. It immediately follows from the previous corollary H □ 

5 Conclusions 

To improve the quality of repairs we should gather information to the right and 
to the left of the point of detection as long as this information could possibly be 
relevant. A criterion that meets our requirements is to expand the repair mode 
until it is guaranteed to accept the next input symbol, but maintains the chance 
of reconsidering the process once the system has detected that an incorrect repair 
assumption has been made. 
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Abstract. This paper uses parameterized complexity analysis to delimit 
possible non-polynomial time algorithmic behaviors for the finite-state 
acceptor intersection and finite-state transducer intersection and compo- 
sition problems. One important result derived as part of these analyses 
is the first proof of the AP-hardness of the finite-state transducer com- 
position problem for both general and p-subsequential transducers. 



1 Introduction 

Certain applications of finite-state automata are most naturally stated in terms 
of the intersection or composition of a set of automata jTITH] . One approach 
to solving these problems is to use state Cartesian product constructions 
0 pp. 59-60] to build the automaton associated with the intersection or compo- 
sition and then answer the query relative to a determinized and/or minimized 
version of that automaton. Though such queries as emptiness or membership 
can typically be answered in time and space linear in the size of the derived 
automaton, the automaton may have 0{\Q\^^^) states, where |A| is the number 
of automata in the set and |<5| is the maximum number of states in any automa- 
ton in the set. This is to be expected, as many problems on sets of automata are 
AP-hard and hence do not have polynomial-time algorithms unless P — NP. 
However, are there other non-polynomial time algorithmic options for solving 
such problems, e.g., an algorithm whose non-polynomial time complexity term is 
purely a function of |Q| and |i7|, where lAI is the size of the language-alphabet? 
Knowledge of such options would be useful in practice for choosing the most 
efficient algorithm in situations in which one or more of the characteristics of 
the problem are of bounded value, e.g., \Q\ < 4 and |I7| < 26. 

In this paper, techniques from the theory of parameterized computational 
complexity ^ are used to determine part of the range of possible non-polynomial 
time algorithmic behaviors for the finite-state acceptor intersection and finite- 
state transducer intersection and composition problems. These analyses gener- 
alize and simplify results given in m- One important result derived as part of 
these analyses is the first proof of the AP-hardness of the finite-state transducer 
composition problem for both general and p-subsequential transducers. 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 302- f?T71 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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1.1 Terminology 

A finite state acceptor (FSA) is a 5-tuple {Q, S, S, s, F) where Q is a set of 
states, E is an alphabet, 6 \ Q x {E U {e}} x Q is a transition relation, s G Q is 
the start state, and F C Q is a, set of final states. If 6 has no entries of the form 
(g, e, q') and is also a function, i.e., for each q G Q and s G E there is at most one 
state q' G Q such that {q, s, q') G 6, the FSA is a deterministic finite-state 
acceptor (DFA). 

A finite state transducer (FST) is a 6-tuple {Q, Ei, Eo,S, s, F) where Q 
is a set of states, E^ and Eg are the input and output alphabets, respectively, 
S : Q X E* X E* X Q is a transition relation, s G Q is the start state, and F C Q 
is a set of final states. There are several possible definitions of determinism for 
FST; types of interest here are: 

— i-Deterministic FST (sequential FST ill 11 1: For each q G Q and x G E*, 

there is at most one y G E* and q' G Q such that {q, x, y, q') G 6. 

— j/o-Deterministic FST: For each q G Q, x G E* and y G E*, there is at 
most one q' G Q such that {q,x,y,q') G <5. 

All FST in this paper are restricted to singleton labels that are e-free, i.e., 
S : Q X Ei X Eo X Q. Note that such FST will always produce output strings of 
the same length as the input string, [3 Lemma 3.3]. 

2 Parameterized Complexity Analysis 

The theory of AP-completeness 0 proposes a class AP of decision problems 
that is conjectured to properly include the class P of decision problems that have 
polynomial-time algorithms. For a given decision problem FI , if every problem in 
NP reduce^ to FI, i.e., II is AP-hard, then II does not have a polynomial-time 
algorithm unless P = NP. 

It may still be possible to solve AP-hard problems by invok- 
ing non-polynomial time algorithms that are effectively polynomial 
time because their non-polynomial terms are purely functions of 
sets of aspects of the problems that are of bounded size or value 
in instances of those problems encountered in practice, where an 
aspect of a problem is some (usually numerical) characteristic that can be 
derived from instances of that problem, i.e., jQj, lAj, and |A| in the case of 
finite-state automaton intersection and composition problems. The theory of 
parameterized computational complexity P] provides explicit mechanisms for 
analyzing the effects of both individual aspects and sets of aspects on problem 
complexity. 

^ Given decision problems 77 and 77', II reduces to U' , i.e., 77 <„i II' , if there is an 
algorithm A that transforms an instance x of II into an instance y of 77' such that 
A runs in time polynomial in the size of x and x has a solution if and only if y has 
a solution, i.e., x G II if and only if y G II' . 
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Definition 1. A parameterized problem U C E* x E* has instances of the 
form {x,y), where x is ealled the main part and y is called the parameter. 

Definition 2. A parameterized problem 77 is fixed-parameter tractable if 

there exists an algorithm A to determine if instance (x,y) is in 77 in time 
f{y) ■ \x\°' , where f : E^ M is an arbitrary function and a is a constant 
independent of x and y. 

Given a decision problem 77 with a parameter p, let {p)~n denote the param- 
eterized problem associated with 77 that is based on parameter p and let {pc)~ 
77 denote the subproblem of {p)~n in which p has value c for some constant 
c > 0. One can establish that a parameterized problem 77 is not fixed-parameter 
tractable by using a parametric reductioi0to show that 77 is hard for any of the 
classes of the W-hierarchy = {FPT, bb[l], bb[2], . . . , W[P], XP} except FPT, 
where FPT is the class of fixed-parameter tractable parameterized problems 
(see 0 for details). These classes are related as follows: 

FPT C 1T[1] C 1T[2] C • • • C W[P] C-- - CXP 

If a parameterized problem is C-hard for any class C in the IT-hierarchy above 
FPT then that problem is not fixed-parameter tractable unless FPT = C. 

The following lemmas will be used in the analyses given in the next section. 

Lemma 3. [16, Lemma 2.1.25] Given decision problems IT and 77' with 
parameters p and p' , respectively, if IT <m 77' such that p' = g{p) for an 
arbitrary function g, then {p)-T[ parametrically reduces to (p')-77'. 

Lemma 4. [16, Lemma 2.1.35] Given a set S of aspeets of a decision problem 
77, if 77 is NP-hard when the value of every aspect s € S is fixed, then the 
parameterized problem {S)-II is not in XP unless P = NP. 

3 Results 

The analyses in this section will focus on the following three problems: 
Bounded DFA Intersection (BDFAI) 

Instance: A set A of DFA over an alphabet E and a positive integer k. 
Question: Is there a string x G E^ that is accepted by each DFA in A? 

i/o-DETERMINISTIC FST INTERSECTION (FST-I) 

Instance: A set A of 7/o-deterministic FST, all of whose input and output 

alphabets are E^ and Eg, respectively, and a string u G E~^ . 

Question: Is there a string s G E\f^ such that the string-pair u/s is accepted 

by each FST in A? 

^ Given parameterized problems 77 and 77', 77 parametrically reduces to 77' if 
there is an algorithm A that transforms an instance {x, y) of 77 into an instance 
{x' , y') of 77' such that A runs in time f{y)\x\°' time for an arbitrary function / and 
a constant a independent of both x and y, y' = g{y) for some arbitrary function g, 
and (x,y) G II if and only if (x' ,y') G 77'. 
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i/o-DETERMINISTIC FST COMPOSITION (FST-C) 

Instance: A set A of f/o-deterministic FST, all of whose input and output 

alphabets are E, a composition-order O on these FST, and a string u G 
Question: Is there a sequence of strings {sq, si, . . . , S|^|} with sq = u and 

Si G for 1 < J < |A| such that for the ordering {oi, 02 , . . . , a|A|} of ^ under 
O and 1 < i < |A|, Si-i/si is accepted by a^? 

The version of problem BDFAI in which x G E* is P S P AC E-complete 0 and 
problem FST-I is iVP-hard by a slight modification of the reduction given in 
0 Section 5.5.1]. All parameterized complexity analyses given in this section will 
be done relative to the following aspects: the number of finite-state automata 
in A (|A|), the required length of the result-string {k in the case of BDFAI, 
|m| in the case of FST-I and FST-C), the maximum number of states of any 
finite-state automaton in A (|(5|), and the size of the alphabet (|A| in the case 
of BDFAI and FST-C, \Ei\ and \Eo\ in the case of FST-I). 



3.1 Bounded DFA Intersection 

Hardness results will be derived via reductions from the following problems: 

Longest common subsequence (LCS) jSJ Problem SRIO] 

Instance: A set of strings X = {xi,. . . , Xk} over an alphabet E, an integer m. 
Question: Is there a string y G A"* that is a subsequence of for f = 1, . . . , fc? 

Dominating set 0 Problem GT2] 

Instance: A graph G = (V,E), an integer k. 

Question: Is there a set of vertices F' C F, |F'| < fc, such that each vertex in 

V is either in V or adjacent to a vertex in VI 

Note that all reductions below are phrased in terms of BDFAI/j, the restricted 
version of BDFAI in which k < |Q|. 

Lemma 5. LCS <m BDFAI/j. 

Proof. Given an instance {X, k, E, m) of LGS, construct the following instance 
(A', E' , k') of BDFAI /j: Let E' = E k' = m, and A' be created by applying to 
each string x G X the algorithm in ^ which produces a DFA on |x|-|-l states that 
recognizes all subsequences of a string x. Note that in the constructed instance 
of BDFAI/j, |A'| = k, k' = m, and \E'\ = |A|; moreover, k' = m < |Q|. □ 

Lemma 6. Dominating set <m BDFAI/?. 

Proof. Given an instance (G = (V,E),k) of Dominating set, construct the 
following instance (A', E' , k') of BDFAI/?: Let E' = V he an alphabet such that 
each vertex v G V has a distinct corresponding symbol in E', and let k' = k. 
For each v G V, let adj{v) be the set of vertices in V that are adjacent to v 
in G (including v itself) and nonadj(v) = V — adj(v). For each vertex v G V, 
construct a two-state DFA A„ = ({gi, 52 }, E', S, gi, { 92 }) with transition relation 
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^ = {{qi,v',qi) I v' G nonadj{v)} yj {{qi,v' ,q 2 ) \ v' G adj{v)} VJ {{q^^v' ,q 2 ) \ 
v' G V}. Let A' be the set consisting of all DFA Ay corresponding to vertices 
V G V plus the k + 1 state DFA that recognizes all strings in Note that 
in the constructed instance of BDFAI/j, |A'| = \E'\ + 1 = |F| + 1, fc' = fc, and 

\Q\ = k + 1; moreover, k' = k < \V\ and k' = k < \Q\. □ 

Lemma 7. BDFAIfl <m BDFAI/} such that |A| = 2. 

Proof. Given an instance {A, E, k) of BDFAIij, construct the following instance 
{A' , E' , k') of BDFAI/j: Let E' = {0, 1} and assign each symbol in A a binary 
codeword of fixed length £ = [log | A|] . For each DFA a G A, create a DFA a' G A' 
by adjusting Q and <5 such that each state q and its outgoing transitions in a is 
replaced with a “decoding tree” on 2^ — 1 states in a' that uses £ bits to connect 
q to the appropriate states. Finally, let k' = /c£. Note that in the constructed 
instance of BDFAI/j, \A'\ = |A| and |A'| = 2; moreover, as (|Q| + 1)(|A| — 1) < 

\Q'\ and k < \Q\, k' = k£ < |Q| [log |A|1 < (|Q| + 1)(| A| - 1) < \Q'\. □ 

Theorem 8. 

1. BDFAI/j is NP-hard when |A| = 2. 

2. {k, |A|)-BDFAIfl is in FPT. 

3. (|A|, |Q|)-BDFAIfl is in FPT. 

4. (jg|, |A|)-BDFAIfl is in FPT. 

5. (|A|,fc)-BDFAI/j is W[l]-hard. 

6. {k, |Q|)-BDFAIfl is W[2]-hard. 

1. (|A|, I A| 2 )-BDFAIij is W[t]-hard for all t > 1. 

8. (|A|)-BDFAIfl ^ XP unless P^ NP. 

Proof of (1): Follows from the A^P-hardness of LCS 0 Problem SRIO], the 
reduction in Lemma |51 from LCS to BDFAI/j, and the reduction in Lemma 0 
from BDFAI/i to BDFAI/j in which |A| = 2. 

Proof of (2): Follows from the algorithm that generates all | A|^ possible fc- length 
strings over alphabet A and checks each string in 0{\A\k) time to see whether 
that string is accepted by each of the DFA in A. The algorithm as a whole runs 
in 0(|A|^fc|A|) time, which is fixed-parameter tractable relative to k and |A[ 
Proof of (3): Follows from the algorithm that constructs the intersection DFA 
of all DFA in A and the k + 1-state DFA that recognizes all strings in A*, 
and then applies depth-first search to the transition diagram for this inter- 
section DFA to determine if any of its final states are reachable from its 
start state. The intersection DFA can be created in 0(|Q|AI+i(fc -|- l)|Ap) = 
0(|g[^l+^2fc| Ap) = 0 (|g| time. As the graph G = (V,E) associ- 
ated with the transition diagram of this DFA has \V\ < {k+ l)|g|AI < 2fc|Q|AI 
states and |A| < (k + 1)|Q|AI|N'| < 2fc|Q|AI|i;| arcs and depth-first search runs 
in 0(|A| -I- |A|) time, the algorithm as a whole runs in 0(|Q| AI+ifc|N'p) time, 
which is fixed-parameter tractable relative to |A| and \Q\. 

Proof of (4). This result follows from the observation that there are at most 
|q|I-S||QI X 2l'^l < i/o-deterministic FST for any choice of \Q\ and 
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li^l. Hence, the number of different FST in any set A is at most 1(51^1^11*51+^. 
This suggests the algorithm that removes all redundant FST in A and then per- 
forms the algorithm given in part (3) above. The first step involves checking 
the isomorphism of the transition diagrams of all FST in A, where each tran- 
sition diagram has at most \Q\ vertices and at most |(5p|T'| edges, which can 
be done in 0(((|H|(|H| -l)/2)|Q|IOI|Q|2|r|) = (9((|H|2|g|l'5l+2|r|) time. Hence, 
the algorithm as a whole runs in (9(|Q|*'I‘5I ' time, which is 

fixed-parameter tractable relative to |Q| and iFil. 

Proof of (5): Follows from the kF[l]-completeness of (fc, to)-LCS 0, the reduction 
in Lemma 0 from LCS to BDFAI/j in which \A'\ = k and k' = m, and Lemma 0 
Proof of (6): Follows from the VF[2]-completeness of (fc)-DOMiNATiNG SET 0, 
the reduction in LemmaOfrom Dominating set to BDFAI/j in which k' = k 
and IQI = fc -I- 1, and Lemma 0 

Proof of (7)\ Follows from the VF[t]-hardness of (fc)-LCS for t > 1 0, the 
reduction in Lemma 0 from LCS to BDFAI/j in which \A'\ = k, the reduc- 
tion in Lemma Q from BDFAIfl; to BDFAIfl; in which \A'\ = |A| and \E'\ = 2, 
and Lemma 0 

Proof of (8): Follow from (1) and Lemma 0 □ 

As BDFAI/} is a restriction of BDFAI, all hardness results above also hold for 
BDFAI. However, as the value of k is not necessarily bounded by a polynomial 
in the instance size in BDFAI, the algorithm for part (4) only works (and hence 
results (4) and (5) only hold) for BDFAI if k is also included in the parameter. 

3.2 i/o-Deterministic FST Intersection 
Lemma 9. BDFAI^ <m FST-I. 

Proof. Given an instance (A, E, k) of BDFAI /{, construct the following instance 
(A', A', A' , u') of FST-I: Let A' = {A} for some symbol Z\ ^ A, A' = A and 
u’ = Given a DFA a = {Q,E,S,s,F), let FSTu{a) = {Q,E[,E,5 f,s,F) 
be the FST such that 5p = {{q, A,x,q') \ {q,x,q') € i5} and let A' be the set 
consisting of all FST FSTu{a) corresponding to DFA a G A. Note that in the 
constructed instance of FST-I, |A'| = |A|, |u'| = k, \Q'\ = \Q\, \E'J\ = I, and 
= □ 

Theorem 10. 

1. FST-I is NP-hard when \Ei\ = I and |Ao| = 2 and when \Q\ = 4 and 
\Eo\ = 3 

2. (|ul, |Ao|)-FST-I, (|A|,|Q|)-FST-I, and (|g|, |A|)-FST-I are in FPT. 

3. (|A|,|n|,|Ai|i)-FST-I isW[l]-hard. 

4. (|u|,|g|,|Ai|i)-FST-I isW[2]-hard. 

5. (|A|, |Ai|i, |Ao| 2 )-FST-I is W[t]-hard for all t >1. 

6. ([A„L |Ai|j-FST-I and (|g|, I Ao|)-FST-I are not in XP unless P = NP. 
Proof. (Sketch) Almost all results follow by arguments similar to those in 

Theorem 0 relative to the reduction in Lemma 0 and the appropriate results 
in Theorem 0 The iVP-hardness of FST-I when |g| = 4 and |Ao| = 3 follows 
from the reduction in 0 Section 5.5.3]. □ 
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3.3 i/o-Deterministic FST Composition 

The following reduction formalizes the observation (made independently by 
Karttunen |^) that FSA intersection can be simulated by the composition of 
identity-relation FST. 

Lemma 11. BDFAIfl <„ FST-C. 

Proof. Given an instance {A, E, k) of BDFAI/j, construct the following instance 
{A' ,0' , E' ,u' ,) of FST-C: Let E' — E U {A} for some symbol A ^ E, and 
u' = A^. Given a DFA a = {Q, E,S, s, F), let FSTu{a) = {Q, E, E,Sp, s, F) 
be the FST such that 6p = {{q,x,x,q') \ {q,x,q') G i5}. Let A' be the set 
consisting of all FST FSTu{a) corresponding to DFA a G A plus the special 
FST FSTinit = ({gi},{/4}, A,(5,gi,{gi}) for which S = {{qi,A,x,q') \ x G A}, 
and let O' be an ordering on A' such that FSTinu is the first FST in O and the 
other FST appear in an arbitrary order. Note that in the constructed instance of 
FST-C, |A'| = |A| + 1, |u'| = k, \Q'\ = max(|Q|, 1) = |Q|, and |A'| = |A| + 1. □ 

Theorem 12. 

1. FST-C is NP-hard when |A| = 3. 

2. (|u|, |A|)-FST-C and (|A|, |Q|)-FST-C are in FPT. 

3. (|A|,|m|)-FST-C is W[l]-hard. 

4. (|u|, IQD-FST-C is W[2]-hard. 

5. (|A|, |A| 3 )-FST-C is W[t]-hard for all t > 1. 

6. (|A|)-FST-C is not in XP unless P = NP. 

Proof. {Sketch) Almost all results follow by arguments similar to those in 
Theorem 0 relative to the reduction in Lemma M and the appropriate results 
in Theorem 0 The fixed-parameter tractability of (|u|, | A|)-FST-C follows by 
a variant of an algorithm in jEl Theorem 4.3.3, Part (3)] that uses an |A|I“I- 
length bit-vector to store the intermediate sets of strings produced during the 
FST composition. □ 



4 Discussion 

All parameterized complexity results for problems BDFAI, FST-I, and FST-C 
that are either stated or implicit in the previous section are shown in Tables 0 
and0 Recall that results for problems FST-I and FST-C are stated relative to 
restricted FST; hence, only hardness results necessarily hold for these problems 
relative to general FST, and given FPT algorithms are at best outlines for 
possible FPT algorithms for general FST (see [El Sections 4.3.3 and 4.4.3] 
for further discussion). Future research should both look for algorithms that 
exploit the sets of aspects underlying the state Cartesian product (|A|, |Q|) and 
exhaustive string generation (|m|, |A|) constructions) in new ways and consider 
other aspects of finite-state automaton intersection and composition problems, 
such as characterizations of logic formulas describing the automata m 



The Parameterized Complexity of Intersection and Composition Operations 309 



Table 1. The Parameterized Complexity of the Bounded DFA Intersection 
and i/o-DETERMiNiSTic FST Composition Problems, (a) The Bounded DFA 
Intersection Problem, (b) The i/o-deterministic FST Composition Problem. 

(a) (b) 



Parameter 


Alphabet Size |I7| 


Unbounded 


Parameter 


— 


AP-hard 


^XP 

unless P = NP 


1^1 


IF[t]-hard 


IT [t] -hard 


k 


IT [2] -hard 


FPT 


101 


IT [2] -hard 


??? 


|A|. k 


IT[l]-hard 


FPT 


|d|. IQI 


??? 


??? 


k, IQI 


IF [2] -hard 


FPT 


|A|, k, IQI 


FPT 


FPT 



Parameter 


Alphabet Size \U\ 


Unbounded 


Parameter 


- 


N P-hard 


^XP 

unless P = NP 


1^1 


IT[t]-hard 


IF[t]-hard 


ki 


IT[2]-hard 


FPT 


IQI 


IT[2]-hard 


??? 


|A|, \u\ 


IT[l]-hard 


FPT 


IQI 


FPT 


FPT 


|ii|, IQI 


IT[2]-hard 


FPT 


|T|, |«| IQI 


FPT 


FPT 



Table 2. The Parameterized Complexity of the i/o-deterministic FST 
Intersection Problem. 



Parameter 


Alphabet Sizes (|T'i|,|X'o|) 


(Unb,Unb) 


(UnbjPrm) 


(Prm,Unb) 


(Prm,Prm) 


— 


AP-hard 


^ XP 

unless P = AP 


^ XP 

unless P = AP 


^ XP 

unless P = AP 


1^1 


IT [t] -hard 


IT[t]-hard 


IT [t] -hard 


IF[t]-hard 


|u| 


IT [2] -hard 


FPT 


IT[2]-hard 


FPT 


IQI 


IT [2] -hard 


^ XP 

unless P = AP 


IT[2]-hard 


FPT 


|A|,|«| 


IT[l]-hard 


FPT 


IT[l]-hard 


FPT 


I^UQI 


FPT 


FPT 


FPT 


FPT 


|w|,|Q| 


IT [2] -hard 


FPT 


IT[2]-hard 


FPT 


I^I.I^UQI 


FPT 


FPT 


FPT 


FPT 



One such aspect of great interest is FST ambiguity (essentially, the maximum 
number of strings associated with any input string by a FST). Problems FST-I 
and FST-C are solvable in low-order polynomial time when they are restricted 
to operate on sequential FST, i.e., FST that associate at most one output string 
with any input string. The reduction in Tyemma, II 1 1 suggests that the presence 
of only one f /o-deterministic FST can make FST composition NP-hard. What 
about more restricted classes of FST? One candidate is the p-subsequential 
FST 1^3) which are essentially sequential FST which are allowed to append 
any one of a fixed set of p strings to their output. Such FST seem adequate for 
efficiently representing the ambiguity present in many applications CH . However, 
the following result suggests that this observed efficiency is not universal. 

Theorem 13. 2-SUBSEQUENTIAL FST composition is NP-hard. 
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Proof. (Sketch) Given instances of BDFAI/j created in LemmaQ the reduction in 
Tjemma, nTl ca,n be modified by setting u' = A and replacing the i/o-deterministic 
FST FSTinit with a set of |u| 2-sequential FST that echo their input and append 
a 0 or a 1 (that is, a set of 2-subsequential FST whose composition generates all 
possible strings of length 1^1 over the alphabet {0, 1}). □ 



Acknowledgments. The author would like to thank the CIAA referees for 
various helpful suggestions, and for pointing out several errors in the original 
manuscript as well as solutions for some of these errors. 
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Abstract. In this paper, we combine (and refine) two of Brzozowski’s 
algorithms — yielding a single algorithm which constructs a minimal 
deterministic finite automaton (DFA) from a regular expression. 



1 Introduction 



To obtain a minimal DFA, an implementor usually codes separate construction 
and minimization algorithms, composing them (at run-time) to yield the min- 
imal DFA. A single, combined algorithm, which both constructs the DFA and 
simultaneously minimizes it is likely to be even more efficient thanks to the 
fact that fewer intermediate data-structures are built. Recent examples of this 
phenomenon can be found in j,SI4l8IHIl 1 1I ,*f| which all present algorithms that 
construct automata while maintaining minimality or near minimality. Those al- 
gorithms have been shown to out-perform the naive run-time composition of a 
construction algorithm with a minimization algorithm. 

In this paper, the two algorithm design focii are: 



1. The real-life performance of the algorithm. The asymptotic running time is 
usually a complex function of the inherent complexity of the input regular 
expression, which we do not discuss further here. 

2. The quality of the output automaton — in this case, the result is a minimal 
automaton. 



1.1 Related Work 

There are numerous possible minimizing DFA construction algorithms, at least 
one for each possible combination of a construction algorithm with a minimiza- 
tion algorithm. (Using jlH] as a basis, this yields at least two hundred possibil- 
ities.) The algorithm presented here is a relatively elegant combination of two 
algorithms which themselves are simple and easily presented. The resulting al- 
gorithm is easy to manipulate and refine, and therefore easy to optimize and 
implement. Preliminary benchmarking indicates that the new algorithm out- 
performs a runtime composition of the two component algorithms. 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. .Sll- tHTl 2001. 

© Springer- Verlag Berlin Heidelberg 2001 
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1.2 Preliminaries 

We assume that the reader is reasonably familiar with elementary finite au- 
tomata theory, definitions and principles, including: states, transitions, deter- 
minism, minimality, regular expressions, language denoted by a regular expres- 
sion, derivatives (of a regular expression), and similarity of regular expressions. 
(In the remainder of this paper, all derivatives are actually a short-hand for their 
similarity equivalence classes.) Readers who are not familiar with this material 
should consult any one of the standard text-books, for example |7in) . 

One definition which is not always presented in text-books is that of reversal. 
We can define functionQ reverse which reverses: 

— a string, returning a string with the letters of the original string in the reverse 
order; 

— a language, by returning the language consisting of the reversal of each string 
in the original language; 

— a regular expression; this can be defined inductively on the structure of 
regular expressions; and 

— a finite automaton, by returning a new automaton in which the direction 
(source and destination) of each transition is reversed, start states (in the 
original automaton) are made into final states, and final states (in the original 
automaton) are made into start states. 

We define subset to be the subset construction — a function which takes a 
finite automaton and returns a DFA with no unreachable states, accepting the 
same language (see, for example, P). 

All of the algorithms presented here are in the guarded command language 
(with some additional annotations for comments), a type of pseudo-code — 
see p. 

2 The Component Algorithms 

We begin with the construction portion of such an algorithm. From m Chap- 
ter 6], there are at least twenty known construction algorithms; unfortunately, 
the performance data presented in pa Chapter 14] does not include all of those 
algorithms. Nonetheless, annecdotal experience (also from industrial applications 
such as in computation linguistics) indicates that a good choice is the deriva- 
tives-hased algorithm by Brzozowski Pj (we refer to this algorithm as Brzconstr). 
That algorithm, which yields a DFA, is exceptionally easy to present and to im- 
plement, satisfying the first of our two goals. Furthermore, it can be efficiently 
implementec0 by providing a regular expression simplifier, since each state (in 

^ Despite being called a function, it would likely be implemented as an imperative 
program. 

^ Extensive benchmarking data for this is not available, however, industry applications 
of the algorithm have show that it is more efficient that other popular algorithms 
such as the Aho-Sethi-Ullman algorithm llOj. 
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the resulting automaton) is represented by a regular expression. Such a simplifier 
is then able to identify two states which may otherwise have been distinguished, 
for example, the states a-b and a-e-b would be identified if the simplifier is aware 
of the identity E ■ e = E. (In order for Brzozowski’s construction to terminate 
(correctly), the simplifier must at least recognize similarity; we do not discuss 
this further in this paper.) 

We now focus on the selection of a minimization algorithm. There are sev- 
eral known minimization algorithms — see PH Chapter 7]. Presently, Hopcroft’s 
algorithm is the algorithm with the best-known asymptotic running time (of 
O(nlogn), for n the number of states). Unfortunately, that algorithm is also one 
of the more difficult algorithms to present, manipulate, refine, and implement — 
cf. that David Gries’s paper |0| shed significant light on the algorithm’s deriva- 
tion. As we discuss in Brzozowski’s minimization algorithm fP has also 
proven to be exceptionally good in practice, usually out-performing Hopcroft’s 
algorithm. (For a performance comparison, see [IDl Chapter 15].) Without pre- 
senting the details (which can be found in [I Djb Brzozowski’s minimization al- 
gorithm is defined as follows (for automaton M, where the semicolon is used for 
sequential composition of the functions, and we may choose to implement the 
‘functions’ as imperative programs): 

Brzmin(M) = (reverse; subset; reverse; subset) (M) 

3 Combining Algorithms 

In this section, we combine the two algorithms by Brzozowski to yield a more 
efficient single algorithm than their run-time composition. Initially, we consider 
expression: 

(Brzconstr; Brzmin)(£’) 

Expanding the definition of Brzmin, we get 

(Brzconstr; reverse; subset; reverse; subset)(£') 

Straight-forward refinements and improvements of this algorithm can be ob- 
tained by combining algorithm components which are adjacent in terms of com- 
position, and for the moment we ignore the right-most component (the subset). 

We can make an important observation at this point: reversal commutes with 
all construction algorithms (mapping a regular expression to a finite automaton) . 
This allows us to switch the first two components of the composition: 

(reverse; Brzconstr; subset; reverse; subset) (A) 

The latter half of Brzmin (reverse; subset) only requires that its input is a DFA 
accepting the language denoted by reverse(E). Since Brzconstr already yields a 
DFA, the invocation of subset following Brzconstr is redundant, and we simplify 
it to 



(reverse: Brzconstr; reverse; subset)(E) 



314 



B.W. Watson 



To make further improvements, we switch to the imperative version of the al- 
gorithm. {Q refers to the set of states S,F C Q are the sets (respectively) of 
start and final states, S is the transition relation, and £ is an overloaded function 
giving the language of a regular expression or finite automaton.) 

Algorithm 3.1: 

E' : = reverse(if); 

Q,S,S,F: = 0,0,{E'},0; 
done, to-do : = 0, {E'}; 
do tO-do yf 0 — >■ 

let p be some state such that p G to-do; 
done, to -do : = done U {p}, to-do \ {p}; 
destination : = a~^p — the left derivative of p by a; 
if destination ^ done — >■ 

{ destination's out-transitions are still to be built } 
to -do : = to -do U {destination} 

I destination € done — >■ skip 

fi; 

Q \ = Q0 {destination}; 

{ make a transition from p to destination on a } 

6{p,a) : = destination; 
if e G C{destination) — >■ 

{ this should be a final state } 

F : = F U {destination} 

I e ^ C{destination) — >■ skip 
fi 

od; 

{ C{Q,5,S,F) = £ (reverse (if)) } 

{Q, 5, S,F) : = subset(reverse((5, S, S, F)); 

{C{Q,S,S,F) = C{E) } 



□ 

In the above algorithm, we begin by reversing the regular expression E (giving 
E'). Thanks to the symmetry of derivatives, we have an alternative: use right 
derivatives instead of left derivatives. This yields the following algorithm (in 
which we have removed E' and the changes from the previous algorithm have 
been underlined): 

Algorithm 3.2: 

Q,5,S,F-. = 0,0,{E},0; 
done, to-do : = 0, {E}; 
do tO-do yf 0 — >■ 

let p be some state such that p G to -do; 
done, to-do : = done U {p}, to-do \ {p}; 
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destination : = pa~^ — the right derivative of p by a; 
if destination ^ done — >■ 

{ destination's out-transitions are still to be built } 
to -do : = to -do U {destination} 

I destination € done — >■ skip 

fi; 

Q \ = Q\J {destination}; 

{ make a transition from p to destination on a } 
S{p,a) : = destination; 
if e G L{destination) — >■ 

{ this should be a final state } 

F : = F \J {destination} 

I e ^ C{destination) — >■ skip 
fi 

od; 

{ C{Q, S, S, F) = £(reverse(if)) } 

(Q, (5, S,F) : = subset(reverse(Q, 6, S, F)); 

{C{Q,S,S,F) = C{E) } 



□ 

Our remaining goal is to combine the final reversal (of the DFA) with the preced- 
ing portion of the algorithm. Finite automaton reversal is as simple as exchanging 
the places of the start and final states {S and F), and reversing the update of 
the transition function (<5), yielding: 

Algorithm 3.3: 



Q,6,S,F: = 0,0,{E},0; 
done, to-do : = 0, {E}; 
do tO-do yf 0 — >■ 

let p be some state such that p G to -do; 

done, to-do : = done U {p}, to-do \ {p}; 

destination : = pa~^ — the right derivative of p by a; 

if destination ^ done — >■ 

{ destination's out-transitions are still to be built } 
to -do : = to -do U {destination} 

I destination G done — >■ skip 

fi; 

Q \ = Q0 {destination}; 

{ make a transition from destination to p on a } 

6 {destination, a) : = p; 
if e G C{destination) — >■ 

{ this should be a final state } 

F : = F U {destination} 

I e ^ C{destination) skip 
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fi 

od; 

{ C{Q,S, S, F) = £( reverse (£')) } 
(Q, (5, S,F) : = subset(Q, S, F, S); 
{C{Q,S,S,F) = C{E)} 



□ 

This is our final algorithm. We will not manipulate the remaining composition 
further (since it does not appear to yield any performance or readability advan- 
tages), nor will we explicitly present subset. 

4 Closing Comments 

The derived algorithm has the following characteristics: 

— The derivation of the algorithm is easily understood and the correctness is 
easily established. 

— Thanks to the easily-understood derivation, the algorithm is also quickly 
implemented. 

— The final algorithm is as easy to implement as the original Brzconstr. The 
differences can be factored, leaving a common algorithmic skeleton. (This 
technique is used in m for keyword pattern matching algorithms with a 
common skeleton.) 

— One of the algorithmic components, the subset construction, is usually al- 
ready implemented in automata toolkits. 

— Like Brzozowski’s derivatives-based construction algorithm, the performance 
is easily improved by implementing an improved regular expression simpli- 
fier. 

— The component algorithms have excellent performance in practice. It follows 
that the new algorithm will display similar (if not better) performance — 
though this is still being verified. 

Acknowledgements. I would like to thank Nanette Saes and the anonymous 
referees for improving the quality of this paper. 
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Abstract. There are known mechanisms to succinctly describe regular 
languages, such as nondeterministic finite automata, boolean automata, 
and statecharts. The MERLin project is an investigation into and com- 
parison of different description mechanisms for the regular languages. In 
particular, we are concerned with descriptions which, for a specific appli- 
cation domain, often achieve succinctness. To this end we implemented a 
Modelling Environment for Regular Languages (MERLin). This paper 
describes the application of the MERLin system to analyze the behaviour 
of selective nondeterministic finite automata. 



1 Introduction 

There are known mechanisms to succinctly describe regular languages, such as 
nondeterministic finite automata (NFAs) m, boolean automata cni and state- 
charts 0. However, in practical applications, it may happen that the theoretical 
succinctness bound is seldom achieved. We are interested in the ‘average suc- 
cinctness behaviour’ of description mechanisms for the regular languages. That 
is, the relative frequency with which a succinct description is obtained, using 
a given description mechanism. And, are some mechanisms better than others, 
in this regard? For example, if the number of states is the criterium, would it 
be better to use NFAs or statecharts as the description mechanism in a certain 
application domain? 

The theoretical analysis of new description mechanisms is often complex and 
time-consuming; we needed a practical experiment environment to give a rough 
indication of the behaviour of new description mechanisms before a theoretical 
analysis is undertaken. To this end we implemented a Modelling Environment 
for Regular Languages (MERLin). 

This paper describes the application of the MERLin system to selective non- 
deterministic NFAs(*-NFAs) |14I15| . In Sect. |21we define *-NFAs. In Sect.0we 
give a short overview of the MERLin system. We discuss the results obtained 
from the analysis of certain *-NFAs in MERLin in Sect. 01 and show how this 
leads to the use of these *-NFAs as random number generators in MERLin. 



* This research was supported by grants from the University of Stellenbosch. 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 318- t?^ 2001. 
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2 Definition of ^-NFAs 

Van der Walt and Van Zijl introduced *-NFAs in A detailed analysis of the 
succinctness achievable by these machines over the regular languages was given 
in usi. We recap the main definitions on *-NFAs here. 

Definition 1. A -k-NFA M is a 6-tuple M = {Q,S,S,qo,F,-k), where Q is the 
finite non-empty set of states, E is the finite non-empty input alphabet, qq € Q 
is the start state and F Q Q is the set affinal states. 5 is the transition function 
such that S : Qx S ^ 2^ , and * is any associative commutative binary operation 
on sets. 

The transition function <5 can be extended to (5 : 2*^ x A — >■ 2*5 by defining 

6{A,a) = glA (1) 

for any a G E and Ag2^ . 

6 can also be extended to ^ : 2*^ x A* — >■ 2*^ in the usual way. 

The -*-NFA accepts a word w if S{qo,w) contains at least one final state 
qf G F. 

Theorem 1. Let C{M) be a language accepted by ak-NFA M. Then there exists 
a deterministic finite automaton (DFA) M' that accepts C(M). 

Proof. By the well-known subset construction ^ , but use Equationdto calculate 
the transition table of the DFA. See uni for more details. □ 



Example 1. Let M be a *-NFA defined by 

M = {{qi,q2,q3},W},S,qi,U3}^*) 



with S given by 



s 


a 


qi 


{91,92} 


q 2 


{92,93} 


qs 


{93}. 



Choose * to be union, so that M is a traditional NFA. Use the subset con- 
struction to find the DFA M' = {Q', {a}, S' , [gi], F'} equivalent to M. Here 5' is 
given by: 



S' 


a 


[qi] 

{91,92/ 

{91,92,93/ 


{91,92/ 

{91,92,93/ 

/ 9 i, 92 , 93 {- 
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If, on the other hand, M were a ©-NFA, the subset construction must be 
applied using symmetric difference instead of union, and then the transition 
function for its equivalent DFA M" is given by: 



5" 


a 


M/ 

M)92/ 

M,92,<73/ 


[qi,q2] 

[qi^qsj 

[qi,q2,q3] 

[qil- 



□ 



3 The MERLin System 

The MERLin system provides an environment to conduct experiments with au- 
tomata over the regular languages. It consists of a graphical user interface front- 
end, the experiment environment itself, and an automata manipulation engine 
as a back-end. 

We adapted existing software to plug in as front-end and back-end. We cur- 
rently use Grail as our back-end. We ported Grail to Linux, added a *-NFA 
class, and run one Grail engine on each node of a Beowulf cluster p. This al- 
lows for parallel processing power in the case of large experiments. MERLin is 
not dependent on a Beowulf; the user may activate the use of the cluster via 
menu options if a Beowulf is available. Note that the parallelization is simply a 
workload distribution over various nodes in the cluster. 

The graphical front-end was built using LEDA 0 , and the LEDA libraries for 
graphs and graph manipulation. The graphical user interface contains standard 
features such as: A finite automata editor; layout algorithms for automata; file 
storage of automata; and menu options to set up experiments on finite automata. 

The MERLin experiment environment is based on finite machines as objects 
(as in Grail). It allows the user to create one or more finite machines by either 
(a) reading a pre-defined automaton from a file, or (b) creating an automaton 
interactively with the automata editor, or (c) using the built-in option to ran- 
domly create a number of automata. The resultant automata are stored in files. 
An experiment is then set up as as a specification (in a menu system) of a series 
of Grail function calls to be applied to those files. The results are again stored 
in files which can be accessed through the user interface. The experiment en- 
vironment also acts as a control shell, for example in controlling the workload 
distribution of an experiment if the Beowulf option applies. 

4 The MERLin Experiment 

We set up a MERLin experiment to compare the ‘average succinctness behaviour’ 
of different *-NFAs. The results lead to a focused theoretical analysis of, specif- 
ically, the unary ©-NFAs ^3- Here © is used in its usual set-theoretic sense as 
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A(BB = (AU S)\(An B). A unary *-NFA is a -At-NFA with one alphabet symbol 
only. 

We used MERLin to randomly generate a large number of unary n-state *- 
NFAs, for -k taken as (respectively) union, intersection and symmetric difference. 
The same set of *-NFAs was generated in each case by fixing the seed of the 
random number generator. The final state set was restricted to one state only 
(state n— 1), and the start state fixed as state 0. For each set of random *-NFAs, 
we 



— converted the *-NFAs to DFAs, 

— minimized the DFAs, 

— and retained only unique regular languages. 

We analysed the results, and Fig. [0 illustrate typical results, in this case for 
n = 5 states. The figure shows the number of states of the minimal DFAs on the 
x-axis versus the number of different regular languages accepted by DFAs with 
this state count on the j/-axis. It is noticeable that 

— The ©-NFAs reach the bound 2” — 1, whereas the U-NFAs and fl-NFAs do 
not (it is known that unary U-NFAs have the bound eV" " P|). 

— There are (relatively) many languages that can be represented succinctly 
with the ©-NFAs. 

— The graph of the ©-NFAs has many ‘gaps’, where there are no DFAs with a 
certain number of states (e.g. 21 to 30). 

The ri-NFAs behave similar to U-NFAs, but the ©-NFAs show rather re- 
markable results. We repeated the experiments with different values of n, with 
different random number generators and different seeds, and with different start 
and final state sets. The overall results stayed the same. Further theoretical 
analysis of the unary ©-NFAs was undertaken. 

We define the state cycle of a unary ©-NFA M as the cycle in its equivalent 
unary DFA M' . 0 The length of the state cycle of M is the length of the cycle 
of M'. 

Careful scrutiny of the graph in Fig.Q (and similar graphs for other values 
of n) reveals that, at each value i which is a factor of 2" — 1, there is a rela- 
tively large number of DFAs with that cycle length. This fact, together with the 
‘gaps’ mentioned above, is reminiscent of well known results from the theory of 
switching circuits and the theory of finite fields m- 

We showed that a unary ©-NFA can be encoded so that its state cycle can 
be seen to be that of a linear feedback shift register (LFSR) over the Galois 
field GF(2) 

^ In the graphical representation of a unary DFA it is easy to see that every node has 
a single successor, and the graph hence forms a sequence of nodes. The successor of 
the last node in this sequence of nodes may be the last node itself, or any of the 
previous nodes. The successor of the last node therefore determines a cycle in the 
graph. This cycle can be of length 1 (if the last node returns to itself), or of length 
k if it returns to the {k — l)-th predecessor node of the last node. 
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Let Ti denote the i-dimensional vector space of column vectors over GF(2). 
A linear machine over GF(2) is a 5-tuple 



M = {Tk,J^l,Tm,T,L0), 

where Tk is the set of states, T\ the set of inputs, Tm the set of outputs, and r 
and Lo are linear transformations such that r : Tk+i — >■ Tk and w : Tk+i — >■ Tm- 
The next state Y(t) of a linear machine at time t can be described as a 
function of the present state y(t) and the inputs x(t). Similarly, the output z(t) 
at time t is a function of the present state y(t) and the inputs x(t). In matrix 
notation, 



( 2 ) 
( 3 ) 

A, B, C and D are the characterizing matrices of M . 



Y = Ay + Bx 
z = Cy -I- Dx. 



5 state unary *-NFA 




Number of states in minimal DFA 

S intersection NFA 
Din union NFA 

E symmetrical difference NFA 



Fig. 1. DFA states vs unique regular languages for 5-state *-NFAs 
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An autonomous linear machine is a linear machine with no input. That is, 
B=D=0, so that the matrix equations 0 and become 

y{t) = A‘y(0) (4) 

z(i) = Cy(t). (5) 

An LFSR is a linear autonomous machine in which the characterizing matrix A 
is an n X n matrix with the special form 



■ 0 


0 


... 0 


0-0 


1 


0 


... 0 


ai 


0 


1 


... 0 


02 


0 


0 


... 1 


an- 



and the characterizing matrix C is a 1 x n matrix. 

The characteristic polynomial c{X) of the matrix A above is given by 
det(AI — A). That is, 

c(A) = A” - ... -aiX-ao. 

A is called the companion matrix of c(A). 

The successive powers of the matrix A represents the states of the LFSR: 

Definition 2. Let S be an LFSR with characteristic matrix A. Then A^ rep- 
resent the states of the LFSR, with fc = 1, 2, . . . ,p for some integer p > 1, and p 
the maximum value for which all the A^ are distinct. 



Definition 3. Let S be an LFSR with characteristic matrix A. Let A^ be the 
states of the LFSR, with k = 1,2, ...,p for some integer p > 1, and p the 
maximum value for which all the A^ are distinct. Then =A*, for some i 

with 1 < i < p. The sequence of states A*,. . .,A^ is the state cycle of the LFSR 
S. 



Now encode the transition table of a unary 0-NFA M = {Q, E,5,qo, F,®) 
as an n X n matrix A= [aijjnxn GF(2); for every state qi G Q, let 



Uji — 



1 if qj G S{qi,a) 
0 otherwise. 



It is easy to show by induction that the i-th column of A^ represents the 
states reached by M after reading the input word . 



Theorem 2. For any n-state LFSR S, there is a unary n + 1-state (B-NFA M 
with the same state cycle as S . 
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Proof. Take any LFSR S with next-state function y. Construct a unary ©-NFA 
M = {Q, {a}, ( 5 , go, F)- Let Q = {go, gi, ■ • ■ , Qn} where gi, . . . , g„ are the states of 
S, and go is an additional start state. For every = 1 in y(0), let qi € S(qo,a). 
For S{qi,a), i > 0, take the characterizing matrix A of S' to represent the 
encoding of the transition function of M. Then the j-th state in the state cycle 
of S is given by A-^y(O), which is exactly the j-th state in the DFA equivalent 
to M. □ 

A theoretical analysis of unary ©-NFAs, based on the theorem above, enables 
one to prove various results about the unary ©-NFAs. For example. 

Theorem 3. Unary (B-NFAs are exponentially more succinct than DFAs. 

Proof Outline (see for more detail): For given n, select any primitive poly- 
nomial c{X) of degree n over GF(2). Construct the companion matrix A of 
c{X), and let A be the binary encoding of the transition table of a unary ©- 
NFA M„. Then Las maximal period, and the DFA M' equivalent to Mn has 
size 2" — 1. □ 



Theorem 4. There exists a family of languages {F,n\n>o ^'^^h that Cn is rec- 
ognized hy an n-state D-NFA M which, when interpreted as a (B-NFA, also rec- 
ognizes Ln ■ Moreover, the smallest DFA recognizing Cn has 0(2"') states. 



Proof Define a U-NFA M„ = ({0, . . . , n — 1}, (a, b, c}, 6, 0, F, U), with F = {0} 
and S given by 



6{i, a) 
6{i,b) 



6{i,c) 



{ (* + (n — 1)) mod n , i = 0,1, . . . ,n — 1 
(1 ,i = 0 

< 0 ,i=l 

[i ,i = 2,3, ...,n— 1 
f0 ,i = 0 

< i , i = 1,2, . . . ,n — 2 

[ (0, n — 1} , i = n — 1. 



For any subset A of (0, . . . , n — 1}, P| j G A5{A, cr) = 0 for any a G S. Hence 
Mn generates exactly the same DFA either as a U-NFA or as a ©-NFA. For 
the U-NFA case Leiss m proved succinctness. Since the DFAs are identical the 
result also holds for the ©-NFA. □ 



Theorem 5. There exists a family of languages {/ln}„>i such that Cn is rec- 
ognized by an n-state C\-NFA M which, when interpreted as a (B-NFA, also rec- 
ognizes Cn. 



Proof. See [E|. 



□ 
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4.1 Random Generation of Finite Automata 

It is known that LFSRs can be used as random number generators Q- The 
results of the previous section allow us to use unary 0-NFAs as the random 
number generators in MERLin. 

To generate random numbers with unary ©-NFAs, we apply the following 
six-step process: 

1. Encode the generator unary ©-NFA M into a matrix A as described previ- 
ously. 

2. Convert M to its equivalent DFA step by step, that is, compute A^, for 
fc = 0, . . . ,p, where p is the cycle length of M. 

3. For each A^, compute A^y(O). Here y(0) is the seed for the random sequence, 
which is formed by encoding the start states for M into an n x 1 column 
vector. 

4. The sequence y(0),Ay(0),A^y(0), A^y(O), . . . calculated above is a sequence 
of n X 1 vectors. Take from each vector the element yi; these elements form 
a sequence of bits. 

5. Take the bit sequence obtained above, and group it into equal-sized groups 
groups of bits. 

6. Take each group to represent the binary representation of a whole number. 
This sequence of numbers forms the pseudo-random sequence. 

By the correspondence between LFSRs and unary ©-NFAs, it is trivial to 
show that the unary ©-NFAs perform as well as LFSRs as random number 
generators. It was shown for LFSRs |HI that characteristic polynomials of the 
form — g® — 1 must be combined to obtain pseudorandom sequences with 
good statistical properties. Such combined generators are formed by applying 
the six-step process above to (usually three) different unary ©-NFAs, and taking 
the symmetric difference of the different bit sequences after the fourth step. 

To randomly generate an n-state *-NFA with k alphabet symbols, we group 
the bit stream into blocks of size kn^. This block is interpreted as the transition 
table of a *-NFA with k alphabet symbols. The set of start states and set of final 
states are generated by two other independent streams, each with block size n. 

The random generation of an n-state -At-NFA includes disconnected n-state 
*-NFAs. Previous work m overcame this problem by manually connecting a 
generated *-NFA; another approach may be to discard disconnected *-NFAs. 
Both these approaches change the original pseudo-random bitstream, and there- 
fore compromise the integrity of the random objects. In MERLin we overcome 
the problem by interpreting the results of experiments on randomly generated 
n-state *-NFAs as results on *-NFAs with n or less states (that is, connected 
and disconnected n-state *-NFAs). 

Our method of randomly generating *-NFAs are based on a mapping from 
the transition tables of the *-NFAs to numbers - the block of size fcn^ represents 
one number in the pseudo-random sequence. However, this does not guarantee 
in any way that the stream of generated *-NFAs are random over the domain 
of the regular languages. We are currently investigating methods to map the 
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pseudo-random *-NFA stream to an enumeration of the regular languages in 
order to test the quality of the pseudo-random *-NFA stream over this domain. 

5 Conclusion and Future Work 

We described a Modelling Environment for Regular Languages (MERLin), which 
can be used to conduct experiments on finite automata and regular languages. 
We illustrated the use of the MERLin system in comparing the typical descrip- 
tional complexity behaviour of different types of finite machines. 

We are currently extending the MERLin software, with: A random number 
generator test suite; a cellular automaton class in Grail; and a visual display 
component for automaton execution trees. 
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1 Introduction 

In previous articles we presented a PC based educational software on lexical 
analysis P and semantical analysis p. These systems were developed using an 
authoring system under MS Windows. The user can not change regular expres- 
sions or input words for nondeterministic respectively deterministic finite au- 
tomata. To overcome these restrictions we developed GaniFA, our Java applet 
for visualization of algorithms from automata theory. It can be downloaded from 
our web page http://www.cs.uni-sb.de/GANIMAL as a JAR-file and requires 
a Java Plug-In 1.2. We invite the reader to use it in own web-based exercises, 
lecture notes or presentations on finite automata. Furthermore this web page 
gives a short overview, how this applet can be customized and embedded into 
HTML web pages. 

2 GaniFA 

The GaniFA applet visualizes and animates the following algorithms jS]: 

— Generation of a non-deterministic finite automaton (NFA) from a regular 
expression RE, see Figure P 

— Removal of e-transitions of a NFA. 

— Transformation of a deterministic finite automaton (DFA) from a NFA with- 
out e-transitions. 

— Minimization of a deterministic finite automaton (minDFA) . 

— For each of the above automata generated above, the applet can visualize 
the computation of the automaton on an input word. 

GaniFA is customizable through a large set of parameters. In particular, it is 
possible to visualize only some of the algorithms and to pass a finite automaton 
or a regular expression as well as an input word to the applet. The GaniFA applet 
was embedded into an electronic textbook on the theory of finite automata, which 
can be studied with the help of a usual web browser like Netscape Gommunicator 
or MS Internet Explorer. Gurrently there are English and German versions of 
the textbook and of the applet itself. 

S. Yu and A. Paun (Eds.): CIAA 2000, LNCS 2088, pp. 327- t?^ 2001. 

(c) Springer- Verlag Berlin Heidelberg 2001 
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Fig. 1. Layout of the intermediate and final NFA for the RE {a\b)* . 



3 Conclusion 

Although GaniFA and our electronic textbook only cover a small part of the 
theory on generating finite automata, they can be very useful for introductory 
courses. They provide a new way to access the material and allow for explorative, 
self-controlled learning P). Teachers can not only use our textbook as it is, but 
they can also embed GaniFA in their on lecture notes and exercises. As part of 
our future work, we plan to use the technical framework underlying GaniFA and 
GANIMAM PI to implement customizable, interactive web-based visualizations 
of other computational modelfl 
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Treebag is a system that allows to generate and transform objects of sev- 
eral types. This is accomplished by generating and transforming trees using tree 
grammars and tree transducers, and interpreting the resulting trees as expres- 
sions that denote objects of the desired type. For this, algebras are used, each 
algebra defining an abstract data type (i.e., a set of objects together with a 
number of operations on them). Finally, there are displays whose purpose is to 
visualise the generated objects. Tree grammars, tree transducers, algebras, and 
displays are the four types of Treebag components everything relies on. 

In order to explain how Treebag works, some basic concepts are required. 
By a tree we mean a rooted and ordered tree with node labels taken from a 
ranked alphabet (or signature). A node which is labelled with a symbol of rank 
n must have exactly n children. For example, the signature 
S = {-|-: 2,/ac: l}U{c: 0 | c G N} contains the symbols -I- and 
fac of rank 2 and 1, respectively, and all natural numbers, 
each considered as a symbol of rank 0. One of the trees over 
this signature is shown on the right. Seen as a term in the 
natural way, it would be denoted -|-[ll,/ac[-|-[3, 2]]] or, using 
infix notation for the binary symbol -I- in order to enhance 
readability, 11 -I- /ac[3 -I- 2]. 

An algebra interprets every symbol of a signature as an 
operation on the domain of that algebra (where arities of operations and ranks 
of symbols coincide). Thus, every tree may be considered as an expression that 
denotes an element of the domain. Taking the signature S from above as an 
example, one may, choose the domain N and interpret -I- as addition, fac as 
the faculty function, and c G N as c. Then the tree depicted above denotes the 
number 131. It is important to notice that the signatures and trees themselves 
do not have a particular meaning — one can as well consider a totally different 
algebra to interpret the symbols in S. A tree as such is pure syntax without 
meaning. 

A tree grammar is any device that generates a language of trees. A tree 
transducer transforms input trees into output trees according to some (possibly 

* This research was partially supported by the EC TMR Network GETGRATS, the 
ESPRIT Basic Research Working Group APPLIGRAPH, and the Deutsche For- 
schungsgesellschaft (DFG) under grant no. Kr-964/6-1 
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complex) rule. Now, if one interprets the output trees of a tree grammar or 
tree transducer by means of an appropriate algebra, objects from the respective 
domain are obtained. Thus, one can deal with all kinds of objects in a tree- 
oriented way just by choosing a suitable algebra to interpret the trees. Many 
of the well-known systems studied in formal language theory can be simulated 
nicely in this way. For a more detailed and formal discussion see, e.g., [DEMI 
IDre98alDre98b| . 

Treebag allows to arrange instances of the four types of components as 
nodes of an acyclic graph and to establish input/output relations between them. 
More precisely, the output of a tree grammar or tree transducer can be fed into 
tree transducers and displays. Furthermore, in order to make a display work, 
an algebra must be associated with it. Then the display will interpret its input 
trees accordingly and visualise the resulting objects. 

The system is implemented in pure Java, and it must be stressed that each of 
the mentioned types of Treebag components consists of several classes. There 
is, for example, one class that implements top-down tree transducers and another 
one that implements the so-called YIELD transduction. Both are tree transduc- 
tions, but of a different type. In fact, due to modularity one can easily add new 
types of tree transducers, simply by implementing a corresponding class. 

Every class of Treebag components defines its own syntax. This makes it 
possible to load an instance of the class — a concrete regular tree grammar, for 
example — from a file. Furthermore, every class provides a set of commands for 
interaction. So far, the following Treebag components are available: 

— regular, ETOL, and parallel deterministic total tree grammars; 

— top-down tree transducers, the YIELD transduction, and a meta-class of tree 
transducers called iterator; 

— algebras on truth values, integers, strings, trees, and two-dimensional col- 
lages, as well as algebras that correspond to the chain-code and turtle for- 
malisms, yielding line drawings; 

— displays for a textual representation of objects (which can be used to display 
truth values, numbers, strings, and trees), for a graphical representation of 
trees, and for collages and line drawings. 

As arbitrary compositions of tree grammars and tree transducers can be built, 
the availability of these basic classes opens up a large number of possibilities. 

The newest version of Treebag, including examples ranging from a prime 
test on natural numbers to the generation of Celtic knotwork, is available at 
http : / / WWW . inf ormat ik . uni-bremen . de/~drewes/treebag . 
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Abstract. Compression method (WRAC) based on finite automatons 
is presented in this paper. Simple algorithm for construction finite 
automaton for given regular expression is shown. The best advantage 
of this algorithm is the possibility of random access to a compressed 
text. The compression ratio achieved is fairly good. The method is 
independent on source alphabet i.e. algorithm can be character or word 
based. 

Keywords, word-based compression, text databases, information re- 
trieval, HuffWord, WLZW 



1 Random Access Compression 

Let be A = {ai, 02 , . . . , a„} an alphabet. Document D of length m can be written 
as sequence D = do,di, . . . , dm-i, where di € A. For each position i we are able 
to find out which symbol di is at this position. We must save this property to 
create compressed document with random access. 

A set of position {f; 0 < f < m} can be written as a set of binary words {bi} 
of fixed length. This set can be considered as language L{D) on alphabet {0, 1}. 
It can be easy shown that the language L{D) is regular {L{D) is finite) and it is 
possible to construct DFA which accepts the language L{D). This DFA can be 
created, for example, by algorithm given in Pj. Regular expression is formed as 
bo + + ■ ■ ■ + bm-i- 

Compression of the document D consists in creating a corresponding DFA. 
But decompression is impossible. The DFA for the document D can only decide, 
whether binary word bi belongs to the language L{D) or not. The DFA does not 
say anything about a symbol which appears in position i. Inorder to do this, the 
definition of DFA must be extended. 

Definition 1. A deterministic finite automaton with output (DFAO) is a 7- 
tuple (Q, A, B, S, a, qo, F), where Q is a finite set of states, A is a finite set 
of input symbols (input alphabet), B is a finite set of output symbols (output 
alphabet), S is a state transition function Q x A ^ Q, qq is the initial state, a 
is an output function F ^ B, F Q Q is the set of final states. 

* This work was done under grant from the Grant Agency of Czech Republic, Prague 
No.: 201/00/1031 
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This type of automaton is able to determine for each of the accepted words 
bi which symbol lies on position i. To create an automaton of such a type the 
algorithm mentioned in Q must be extended too. Regular expression V, which 
is input into the algorithm, consists of words bi. Each bi must carry its output 
symbol di. Regular expression is now formed as b^dg + bidi + • • • + bm-idm-i, 
The set of states Q of the automaton DFAO{V) is divided into disjunct 
subsets (so called layers). Transitions are done only between two adjacent layers. 
Thus states can be numbered locally in those layer. Final automaton is stored 
on disk after construction. All layers are stored sequentially. 

Let’s remark, that algorithm of construction of automaton is independent 
with respect to its output alphabet. There are two possibilities. The first is a 
classic character based version. Algorithm is one-pass and output alphabet is a 
standard ASCII. For the text retrieval systems word-based version (the second 
possibility) is more advantageous because of the character of natural languages. 

2 Experimental Results 

To allow practical comparison of algorithm, experiments have been performed on 
some compression corpus. For test has been used Canterbury Compression Cor- 
pus (large files), especially King’s James Bible (bible.txt) file which is 4, 077, 774 
bytes long. There are 153, 5710 tokens and 13,461 of them are distinct. A word- 
based version of algorithm has been used for a test. 

Implementation of this method is described in |2| . 



Table 1. Comparison with other compression utilities 



Compression utility 


Gompressed text [bytes] 


Ratio [%] 


WRAC 


1480884 


36.3 


ARJ 2.41a 


1207114 


29.6 


WINZIP 6.2 


1178869 


28.9 


GZip (UNIX) 


1178757 


28.9 


WinRAR 2.0 


994346 


24.4 


WLZW (word-based LZW) 


896956 


22.0 
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Sequential transducers, introduced by Schiitzenberger 0, have advantageous 
computational properties. A sequential transducer is deterministic with respect 
to its input. Not all transducers can be sequentialized: but if one can be, it means 
time, and, often, space optimality. This article extends the subsequentialization 
algorithm of Mohri m for previously untreated classes of transducers. We 

— change the representation of final p-strings, 

— extend the sequentialization to input e labels and their closures, 

— handle the unknown symbol. 

Mohri uses final p-strings to express p-subsequentiality. We convert them to 
real arcs and states to have a more uniform representation and to maintain the 
two-sided applicability of the transducer. This change is of linear complexity. 

An e-closure set and appropriate modifications in the subsequentialization 
algorithm of Mohri make it possible to handle transducers containing input-side 
e labels. This does not require any intermediate transformation of the transducer. 




Fig. 1. The left-hand transducer has all possible ambiguities of transducing a string of 
2 symbols to another string of 2 symbols (AB to/from ab). Its equivalent, by extended 
e-closure sequentialization, is a linear one (on the right). The left network has 11 paths; 
10 of them are spurious. The states of the networks are annotated with the number of 
paths leading from the given state to a final state. 
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Our e-closure modification solves a complexity problem in subsequentiable 
transducers that contain arcs with an input e label either on the input or on 
the output side. An illustration of such cases is the transduction of a string 
of length n to another string of the same size (Fig. Q]). Such a transducer can 
be ambiguous according to a rapidly growing function in n; a (modest) lower 
bound for the number of possible paths is for which a lower bound is 

0(3”), that is, such a transducer has an exponential number of ambiguous paths 
expressing the same mapping. If such a transducer is Kleene star-red (allowing 
repetitions of the input string), k repetitions will require more than 0((^^)^) 
recognition complexity. We only know a recursive form for the number of possible 
paths in the general case but lower bound approximations show that such cases 
become rapidly untreatable. But such transducers can be transformed, by e- 
sequentialization, making recognition complexity linear, 0(nk). 

By using the e-closure, the ambiguities stemming from input e transitions 
can be handled, and both ordinary and e-ambiguities are (sub)sequentialized in 
the same step, by local modifications. This extension does not increase the com- 
plexity of the original algorithm. The e-closure operation is a semiring operation 
creating a set of directed acyclic graphs. Such e-ambiguities may and do arise in 
finite-state compilers and tools. 

The unknown symbol is an extension of the usual transducer notation to 
define special treatment for input symbols not in the input alphabet of the 
transducer Q. It can be present, if handled specially, in sequentialization and 
in subsequent finite-state calculus operations and applications. The solution is 
local at sequentialization time, with no additional complexity, and needs a run- 
time queue of bounded size; this solution has been reused in another finite-state 
implementation p]. 

The beneficial effects of these transformations have been used in real nat- 
ural language processing cases. They confirm the practical results of Mohri: 
14-40% typical efficiency improvement was measured after subsequentializing 
lexical transducers created by Xerox finite-state tools P|. 
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Abstract. Finite state transductions have been shown to be quite useful 
in a number of areas; however, it is still the case that it is often difficult 
to express certain kinds of transductions without resorting to a state and 
transition view. INR was developed to explore this problem, and several 
applications of transduction were studied as exercises in specification 
during INR’s development. The specification of the NYSIIS phonetic en- 
coding function (developed for the New York State Identification and 
Intelligence System) provides a clear example of many important ideas. 
An INR specification for NYSIIS is provided, which is syntactically sim- 
lar to the prose description and from which INR can directly produce 
the 149 state subsequential transducer. 



Phonetic encoding of names has been used to support search in large 
databases based on surnames which may have been misspelled. The best known 
of these is Soundex, but it has been found to have many defects for common types 
of names. The NYSIIS encoding function m was designed as a replacement to 
improve the handling of Spanish and southern European surnames. 

A prose description of NYSIIS has been converted to INR syntax P (see 
Figure I), resulting in a specification that mirrors the structure of the original 
description and yet can be directly converted to a subsequential transducer and 
compiled into a highly optimized subroutine. This facilitates greater flexibility 
in the design of phonetic encoding functions while preserving implementation 
efficiency. Adding the capability for representing the features of NYSIIS to INR 
has introduced many new operators and constructions, described fully in the 
longer version of this paper p. 
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Vowel = { 


A, 


E, 


I, 


0, 


U }; 






Consonant = { 


B, 


c. 


D, 


F, 


C-I 


L, 


M, N, 




P, 


Q, 


R, 


s. 


T, V, W, X, 


Y, 


Z }; 



Letter = Consonant I Vowel; 

Copyl = Letter $ (0,0); 

Copy = Copyl*; 

Bef oreEqualsAfter = ( Copyl Copyl ) @ ( Letter $00); 
SetToPreceding = ( Copyl ( ) ( Letter, Letter ) ) 

<S@ Bef oreEqualsAfter ; 



Nyl = 


( ’MAC’, 


’MCC 


’ ) Copy 


1 1 


( ’KN’, 


’NN’ 


) Copy 


1 1 


( ’K’, 


’C’ 


) Copy 


1 1 


( ’PH’, 


’FF’ 


) Copy 


1 1 


( ’PF’, 


’FF’ 


) Copy 


1 1 


( ’SCH’, 


’SSS 


’ ) Copy 


1 1 






Copy; 


Ny2 = 


Copy ( ’ 


EE’ , 


’Y’ ) 


1 1 


Copy ( ’ 


IE’, 


’Y’ ) 


1 1 
1 1 


Copy ( { 
Copy; 


’DT’ , 


’RT’ , ’RD’ , ’NT’ , ’ND 


Ny34 


= Copyl [ 




Copy; 


Ny5a 


= Copy ( 


) 

— f 


- ) 


1 


1 Copy ( 


’_EV’ 


, ’_AF’ ) Copy 


1 


1 Copy ( 


’_’ Vowel, ’_A’ ) Copy 


Ny5b 


= Copy ( 


’_Q’, 


’_G’ ) Copy 


1 


1 Copy ( 


’_Z’ , 


’_S’ ) Copy 


1 


1 Copy ( 


’_M’ , 


’_N’ ) Copy; 


Ny5c 


= Copy ( 


’_KN’ 


, ’_NN’ ) Copy 


1 


1 Copy ( 


’_K’ , 


’_C’ ) Copy; 


Ny5d 


= Copy ( 


’_SCH 


’, ’_SSS’ ) Copy 


1 


1 Copy ( 


’_PH’ 


, ’_FF’ ) Copy; 



Ny5e = ( Letter* ( Consonant ’_H’ I ’_H’ Consonant ) Letter* ) 

@@ ( Copy SetToPreceding Copy ) ; 

Ny5f = ( Letter* Vowel ’_W’ Letter* ) @@ ( Copy SetToPreceding Copy ); 
Ny5 = Ny5a I I Ny5b I I Ny5c I I Ny5d I I Ny5e I I Ny5f 
I I Copy ( ) Copy; 

Ny6 = ( ( Letter* Bef oreEqualsAfter Letter* ) 

@@ ( Copy Letter Copy ) ) 

II ( Copy Copyl Copy ); 

Ny3456 = Ny34 (8 ( Ny5 0 Ny6 : clsseq ) ; 

Ny7 = Copy ( ’S’, ~ ) II Copy; 

Ny8 = Copy ( ’AY’, ’Y’ ) I I Copy; 

Ny9 = Copy ( ’A’ , “ ) II Copy; 

NYSIIS = Nyl @ Ny2 @ Ny3456 @ Ny7 @ Ny8 0 Ny9 :sseq; 



Fig. 1. NYSIIS as an INR specification 
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1 Introduction 

We present a method of constructing and using a cascade consisting of a left- 
and a right-sequential finite-state transducer (FST), T\ and T 2 , for part-of-speech 
(POS) disambiguation. Compared to a Hidden Markov model (HMM), this FST 
cascade has the advantage of significantly higher processing speed, but at the cost 
of slightly lower accuracy. Applications such as Information Retrieval, where the 
speed can be more important than accuracy, could benefit from this approach. 

In the process of POS tagging, we first assign every word of a sentence a 
unique ambiguity class Ci that can be looked up in a lexicon encoded by a se- 
quential FST. Every is denoted by a single symbol, e.g. “ [ADJ NOUN] ” , although 
it represents a set of alternative tags that a given word can occur with. The se- 
quence of the Ci of all words of one sentence is the input to our FST cascade 
(Fig. 1). It is mapped by Ti, from left to right, to a sequence of reduced ambi- 
guity classes r^. Every is denoted by a single symbol, although it represents 
a set of alternative tags. Intuitively, Ti eliminates the less likely tags from Cj, 
thus creating Finally, T 2 maps the sequence of r^, from right to left, to an 
output sequence of single POS tags U. Intuitively, T 2 selects the most likely ti 
from every (Fig. 1). 

Although our approach is related to the concept of bimachines |2j and factor- 
ization P, we proceed differently in that we build two sequential FSTs directly 
and not by factorization. 

. . . [DET RELPRO] [ADJ NOUN] [ADJ NOUN VERB] [VERB] . . . 

F 

. . . [DET RELPRO] 

F < 

DET 

Fig. 1. Input, intermediate, and output sequence 



Ti maps left to right >■ if 

[ADJ] [ADJ NOUN] [VERB] . . . 

T 2 maps right to left 

ADJ NOUN VERB . . . 



2 Construction of the FSTs 

In Ti, one state is created for every Vi (output symbol), and is labeled with this 
ri (Fig. 2a). An initial state, not corresponding to any ri, is created in addition. 
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From every state, one outgoing arc is created for every c, (input symbol), and is 
labeled with this c^. The destination of every arc is the state of the most likely 
in the context of both the current Ci (arc label) and the preceding (source 
state label). This most likely is estimated from the transition and emission 
probabilities of the different rj and c^. Then, all arc labels are changed from 
simple symbols Ci to symbol pairs Ci'.ri (mapping Cj to Vi) that consist of the 
original arc label and the destination state label. All state labels are removed 
(Fig. 2b). Those Vi that are unlikely in any context disappear, after minimization, 
from T\. T\ accepts any sequence of Ci and maps it, from left to right, to the 
sequence of the most likely in the given left context. 




Fig. 2. Construction of Ti Fig. 3. Construction of T 2 

In T 2 , one state is created for every ti (output symbol), and is labeled with 
this ti (Fig. 3a). An initial state is added. From every state, one outgoing arc 
is created for every (input symbol) that occurs in the output language of Ti, 
and is labeled with this . The destination of every arc is the state of the most 
likely ti in the context of both the current (arc label) and the following ti+\ 
(source state label). Note, this is the following tag, rather than the preceding, 
because will be applied from right to left. The most likely ti is estimated 
from the transition and emission probabilities of the different ti and Vi. Then, 
all arc labels are changed into symbol pairs : U and all state labels are removed 
(Fig. 3b), as was done in T\. T 2 accepts any sequence of generated by Ti and 
maps it, from right to left, to the sequence of the most likely U in the given right 
context. 

Both Ti and T 2 are sequential. They can be minimized with standard algo- 
rithms. Once Ti and T 2 are built, the transition and emission probabilities of all 
ti, Ti, and Ci are of no further use. Probabilities do not (directly) occur in the 
FSTs, and are not (directly) used at run time. They are, however, “implicitly 
contained” in structure of the FSTs. 

3 Results 

We compared our FST tagger on 3 languages (English, German, Spanish) with 
a commercially available HMM tagger. The FST tagger was on average 10 times 
as fast but slightly less accurate than the HMM tagger (45 600 words/sec and 
96.97% versus 4 360 words/sec and 97.43%). In some applications such as In- 
formation Retrieval a significant speed increase can be worth the small loss in 
accuracy. 
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Abstract. Adaptive technologies [1] are based on the self- modifying 
property of some systems, which give their users a powerful facility for 
expressing and handling complex problems. One may turn a rule-based 
formalism into a corresponding adaptive one by attaching adaptive ac- 
tions to their rules. This work focuses adaptive automata, and adap- 
tive formalism based on structured pushdown automata. Its transitions 
may hold adaptive actions responsible for self-modifications. An example 
ilustrates an adaptive-automata-based solution to an example problem 
focusing the copy language, an interesting context-dependent language. 



Structured pushdown automata are state machines composed by a set of finite- 
state-like mutually recursive sub-machines. In adaptive automata [2], [3] such 
rules may be attached adaptive aetions. (^g, e, sa), A : — >■ (^g\ e\ s’a), B 
represents the general form of a rule in an adaptive automaton. Its left-hand 
side refers to the configuration of the automaton before, whereas the right-hand 
side encodes its configuration after the state transition. The components of the 
3-tuples encode the situation of the pushdown store, the state and the input 
data, respectively. Adaptive actions {A and B) are optionally specified: the left 
one represents modifications to be applied before the state transition, while the 
right one specifies the changes to be imposed to the automaton after the tran- 
sition. Adaptive actions are calls to parametric adaptive functions representing 
collections of elementary adaptive actions to be applied to the transition set 
of the automaton. Three elementary adaptive actions are allowed: inspection, 
deletion and insertion of transitions. 0 [ (^g, e, sa), A : ^ ( jg’, e’ , s’a), B ] 
denotes any elementary adaptive actions by replacing the operator ® by ? for 
the inspection, + for the insertion and — for the deletion of transitions having 
the shape specified in brackets. Adaptive automata are Turing-powerful, so they 
can handle context-sensitive languages. 

Example. The following example ilustrates the use of adaptive automata to im- 
plement an acceptor for the copy language L=ww h with w S S*. This is neither 
a regular nor a context-free language. It is useful in the processing of reduplica- 
tion, a general linguistic construct present in some far-eastern natural languages, 
in which some words are formed by concatenating two copies of another word. 
One adaptive possible acceptor for this language has two submachines: 
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• The main submachine, responsible for the global structure of the sen- 
tence, (1) reads the input string; (2) builds a section of the main submachine that 
allows reprocessing the input string; (3) detects the end of the input string; (4) 
transfers control to the extension built in item 2, where: (5) for each transition, 
the attached symbol is stacked back onto the input string; (6) after restoring 
the whole input string, the secondary submachine is called twice; (7) the input 
string is accepted if and only if the final state is reached. 

• An auxiliary submachine detects the occurrence of duplications within 
the sentence: (1) re-reads the first half of the input string when called the first 
time; (2) tries to match the second half of the sentence when called again. 

The listing below shows the full formalization of such an adaptive automaton. 



the main submachine 
mounting both string halves: 
{7, el, (TO'):—¥ (7, e 2 , cn), A((t) 
(7, e 2 , (7, el, cn), B{(t) 

prepare to re-read: 

{7 , el, ha):— {7, e6, a) ,C ( ) 
re-consuming both halves: 

( 7, e 3 , a): —¥ ( 7e4, e 5 , a) 

( 7, e 4 a): —¥ ( 7ef, e 5 , a) 
final state: 

( 7, ef, a): ( 7, ef, a) 

the auxiliary submachine 
{built by the adaptive actions) 
returns to calling submachine: 

( 7e4, efl, a): ^ ( 7, e 4 , a) 



( 7ef, efl, a): ^ ( 7, ef, a) 
auxiliary (pointer) 
transitions 
( 7, X, a): 7, e 5 , a) 

( 7. y. a): ->■ ( 7, e6, a) 

( 7, m, a): ^ ( 7, e 5 , a) 
adaptive actions 
build the auxiliary submachine 
A (<t): { u,z,t*,w* 

? [ ( 7 , y, a): ->■ ( 7. z, “) ] 

-I- [ ( 7 , z, a): — > ( 7, t, <T a) ] 
- [ ( 7 , y, a): ->■ ( 7. z, “) ] 

-I- [ ( 7 . y, “): ->■ ( 7 , t, a) ] 

? [ ( 7, X, a): ^ ( 7, u, a) ] 

-|- [ ( 7, u, (7 a):^ ( 7, w , a)] 



- [ ( 7. X, a): -I 


" ( 7, u, a) ] 




+ [ ( 7. X, a): - 


A ( 7, w, a) ] 


} 


B (cr): { v,r,s 






A (<t) 






? [ ( 7, m, a): - 


■> ( 7 , s, a) ] 




- [ ( 7, m, a): - 


■> ( 7 , s, a) ] 




? [ ( 7, s, era): - 


->■ ( 7, r, a) ] 




-b [ ( 7, m, a): - 


->■ ( 7 , r. a) ] 


} 


C ( ): { n,t,p,v 






— [ ( 7, n, va) : - 


-*■ ( 7 . P, a) 


1 


? [ ( 7, m, a): - 


■> ( 7, n, a) ] 




+ [ ( 7 , n. a): - 


4(7, efl, a) 


] 


■? [ ( 7 , y, a): -> 


■ ( 7. t, a) ] 




+ [ ( 7 . t, - 


>• ( 7, e 3 , a) 


1 } 



This example shows how adaptive automata may be employed to efficiently 
represent solutions for complex problems. The behavior of adaptive automata 
as piecewise finite-state- or structured pushdown-automata render them easy to 
understand and very adequate as implementation models. Many other classical 
subjects are as well effectively handled by adaptive automata whose behavior 
may often be far better, in both space and time aspects, than usual equivalents. 

Such results encourage further efforts on the subject of this research. Par- 
allel works are in progress at our institution exploring adaptive automata as 
programming paradigms, as computation models, as language implementation 
models, etc. In a near future we expect to have a full working high-level adaptive- 
paradigm language system based exclusively on adaptive automata, starting from 
its grammatical conception, including context-dependent aspects, run-time en- 
vironment and the semantics of its dynamic behavior. 
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